Simply speaking, the Internet is a large network composed of sites and network devices. We access the site through a browser, and the site returns HTML, JS, and CSS codes to the browser. These codes are parsed and rendered by the browser, and then Rich and colorful web pages appear before our eyes.
What is a crawler?
If we compare the Internet to a large spider web, data is stored in each node of the spider web, and a crawler is a small spider that crawls its own information along the network. A prey (data) crawler refers to a program that initiates a request to a website, obtains resources, analyzes and extracts useful data; from a technical perspective, it simulates the behavior of a browser requesting a site through a program, and converts the HTML code/JSON data returned by the site. /Binary data (pictures, videos) Climb locally, extract the data you need, and store it for use.
Basic process of crawler
How users obtain network data:
Method 1: Browser submits request--->Download web page code --->Parsed into a page
Method 2: Simulate the browser to send a request (get the web page code)->Extract useful data->Save it in a database or file
Crawler All you have to do is method 2;
Initiate a request
Use the http library to initiate a request to the target site, that is, send a Request
Request Contains: request header, request body, etc.
Request module defect: Unable to execute JS and CSS code
Get response content
If the server can respond normally, you will get a Response
Response includes: html, json, pictures, videos, etc.
Parsing content
Parsing html data: regular expressions (RE module), third-party parsing libraries such as Beautifulsoup, pyquery, etc.
Parse json data: json module
Parse binary data: write to file in wb mode
Save data
Database (MySQL, Mongdb , Redis)
File
The above is the detailed content of Is it difficult to learn python crawler?. For more information, please follow other related articles on the PHP Chinese website!