How to use Scrapy to crawl JD merchants’ product data
Scrapy is a powerful Python web crawler framework that allows us to easily and conveniently write code to crawl web page data. This article will introduce how to use Scrapy to crawl JD merchants’ product data.
Preparation
Before we start writing code, we need to make some preparations.
1. Install Scrapy
We need to install Scrapy locally. If you have not installed Scrapy yet, you can enter the following command in the command line:
pip install Scrapy
2. Create Scrapy Project
Open the terminal and enter the following command:
scrapy startproject JDspider
This line of command will create a Scrapy project named JDspider in the current folder.
3. Create Spider
In Scrapy, Spider is the core component for crawling data. We need to create a Spider to obtain the product data of JD merchants. Enter the following command on the command line:
cd JDspider scrapy genspider JD jd.com
Here we use the scrapy genspider command to generate a Spider named JD, and use jd.com as its starting URL. The generated code is located in the JDspider/spiders/JD.py file. Now we need to edit this file to complete the crawler.
Analyze the target website
Before writing the code, we need to analyze the target website first. Let’s take https://mall.jd.com/index-1000000127.html as an example.
Open the Chrome browser, press the F12 key to open the developer tools, and then click the Network tab. After entering the URL of the target website, we can see the request and response information of the target website.
We can find that it uses AJAX technology to load product list data. In the XMLHttpRequest tab, we can see the URL of the request and it returned the data in JSON format.
We can directly access this URL to obtain product information.
Get product data
We now know how to get product information, we can add code in Spider to complete this task.
First open the JDspider/spiders/JD.py file and find the definition of the Spider class. We need to modify this class and define its name, domain name and initial URL.
class JdSpider(scrapy.Spider): name = "JD" allowed_domains = ["jd.com"] start_urls = [ "https://pro.jd.com/mall/active/3W9j276jGAAFpgx5vds5msKg82gX/index.html" ]
Start fetching data. In Scrapy, we need to use the parse() method to obtain web page data. We use the json module to parse the returned JSON data and extract the required information. Here, we get the title, price, address and quantity information of the product.
def parse(self, response): products = json.loads(response.body)['data']['productList'] for product in products: title = product['name'] price = product['pricer'] address = product['storeName'] count = product['totalSellCount'] yield { 'title': title, 'price': price, 'address': address, 'count': count, }
Now we have completed the data capture. We can run this spider and output the results to a file. Enter the following command in the terminal to start running Spider:
scrapy crawl JD -o products.json
This is a simple example that just demonstrates how to use Scrapy to crawl JD merchants’ product data. In practical applications, we may need to perform more complex processing. Scrapy provides many powerful tools and modules to achieve this.
The above is the detailed content of How to use Scrapy to crawl JD merchants' product data. For more information, please follow other related articles on the PHP Chinese website!