Home>Article>Backend Development> How to use the scrapy framework to loop through Jingdong data and then import it into Mysql

How to use the scrapy framework to loop through Jingdong data and then import it into Mysql

零到壹度
零到壹度 Original
2018-03-30 10:20:23 1786browse

This article mainly shares with you how to use the scrapy framework to loop through JD.com data and then import it into Mysql. It has a good reference value and I hope it will be helpful to everyone. Let’s follow the editor to take a look, I hope it can help everyone.

JD.com has an anti-crawling mechanism, so I used a user agent and pretended to be a browser.

The crawled data is the mobile phone information URL of JD Mall:https://list.jd.com/list.html?cat=9987,653,655&page= 1

There are about 9,000 pieces of data, and products that are not in the list are not included.

Problems encountered:

1. It is best to use the user agent method (use_proxy), because I wrote the code directly under parse before, and encountered the problem of not enough values to unpack. I really didn’t know which sentence the error was in, so I printed after each sentence of code and found the problem. The problem was with urlopen(), but I tried again and again and checked the Internet, but I couldn't find the error. I solved it by writing a method. Now I think it may be because the parse method handles respose.

2. Before importing the data into mysql, I first tried to import the data into the file, but during the import, I found that the size of x.txt was always 0kb. 1kb is changing, but not growing. Thinking about it, it should be overwritten. I originally thought that I wrote fh.close() in the wrong position, but then I suddenly thought

##fh = open( "D:/pythonlianxi/result/4.txt", "w")is wrong, you should change 'w' to 'a'.

#3. Import the database. The main problem encountered is the Chinese encoding problem. You must first open mysql, show variables like '%char%'; check the character set encoding of the database. Format, use the corresponding form. For example, I use utf8, but it is not easy to use gbk. In addition, don't forget charset='utf8' when writing to connect to mysql.

The following is the specific code:

conn = pymysql.connect(host="127.0.0.1", user="root", passwd="root", db="jingdong", charset="utf8")

import scrapy
from scrapy.http import Request
from jingdong.items import JingdongItem
import re
import urllib.error
import urllib.request
import pymysql
class JdSpider(scrapy.Spider):
name = 'jd'
allowed_domains = ['jd.com']
#start_urls = ['http://jd.com/']
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"}
#fh = open("D:/pythonlianxi/result/4.txt", "w")
def start_requests(self):
return [Request("https://list.jd.com/list.html?cat=9987,653,655&page=1",callback=self.parse,headers=self.header,meta={"cookiejar":1})]
def use_proxy(self,proxy_addr,url):
try:
req=urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36")
proxy = urllib.request.ProxyHandler({"http": proxy_addr})
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
data=urllib.request.urlopen(req).read().decode("utf-8","ignore")
return data
except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)
except Exception as e:
print(str(e))

def parse(self, response):
item=JingdongItem()
proxy_addr = "61.135.217.7:80"
try:
item["title"]=response.xpath("//p[@class='p-name']/a[@target='_blank']/em/text()").extract()
item["pricesku"] =response.xpath("//li[@class='gl-item']/p/@data-sku").extract()

for j in range(2,166):
url="https://list.jd.com/list.html?cat=9987,653,655&page="+str(j)
print(j)
#yield item
yield Request(url)
pricepat = '"p":"(.*?)"'
personpat = '"CommentCountStr":"(.*?)",'
print("2k")
#fh = open("D:/pythonlianxi/result/5.txt", "a")
conn = pymysql.connect(host="127.0.0.1", user="root", passwd="root", db="jingdong", charset="utf8")

for i in range(0,len(item["pricesku"])):
priceurl="https://p.3.cn/prices/mgets?&ext=11000000&pin=&type=1&area=1_72_4137_0&skuIds="+item["pricesku"][i]
personurl = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + item["pricesku"][i]
pricedata=self.use_proxy(proxy_addr,priceurl)
price=re.compile(pricepat).findall(pricedata)
persondata = self.use_proxy(proxy_addr,personurl)
person = re.compile(personpat).findall(persondata)

title=item["title"][i]
print(title)
price1=float(price[0])
#print(price1)
person1=person[0]
#fh.write(tile+"\n"+price+"\n"+person+"\n")
cursor = conn.cursor()
sql = "insert into jd(title,price,person) values(%s,%s,%s);"
params=(title,price1,person1)
print("4")
cursor.execute(sql,params)
conn.commit()

#fh.close()
 conn.close() 
return item
except Exception as e:
print(str(e))

I believe you are smart and have learned it , what are you waiting for, hurry up and practice it.


The above is the detailed content of How to use the scrapy framework to loop through Jingdong data and then import it into Mysql. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn