Home>Article>Backend Development> What does python's crawler mean?
Python crawler is a web crawler (web spider, web robot) developed using Python programs. It is a program or script that automatically captures World Wide Web information according to certain rules. Other less commonly used names include ants, autoindexers, emulators, or worms. In fact, in layman's terms, it is to obtain the data you want on the web page through a program, that is, to automatically capture the data.
A web crawler (English: web crawler), also called a web spider, is a web robot used to automatically browse the World Wide Web. Its purpose is generally to compile web indexes.
Web search engines and other sites use crawler software to update their own website content or their indexes of other websites. Web crawlers can save the pages they visit so that search engines can later generate indexes for users to search.
The process of the crawler accessing the website will consume the target system resources. Many network systems do not allow crawlers to work by default. Therefore, when visiting a large number of pages, the crawler needs to consider planning, load, and "polite". Public sites that do not want to be accessed by crawlers and known by the crawler owner can use methods such as robots.txt files to avoid access. This file can ask the robot to index only part of the site, or not process it at all.
There are so many pages on the Internet that even the largest crawler system cannot fully index them. So in the early days of the World Wide Web, before 2000 AD, search engines often found few relevant results. Today's search engines have improved a lot in this regard and can provide high-quality results instantly.
The crawler can also verify hyperlinks and HTML codes for web crawling.
Python crawler
Python crawler architecture
Python crawler architecture mainly consists of five parts, namely scheduler, URL managers, web downloaders, web parsers, applications (crawled valuable data).
Scheduler: equivalent to the CPU of a computer, mainly responsible for scheduling the coordination between the URL manager, downloader, and parser.
URL manager: includes the URL address to be crawled and the URL address that has been crawled, to prevent repeated crawling of URLs and loop crawling of URLs. There are three main ways to implement the URL manager, through memory and database , cache database to achieve.
Webpage Downloader: Download a webpage by passing in a URL address and convert the webpage into a string. The webpage downloader has urllib2 (Python official basic module), which requires login, proxy, and cookie, requests( Third-party package)
Web page parser: Parsing a web page string can extract our useful information according to our requirements, or it can be parsed according to the parsing method of the DOM tree. Web page parsers include regular expressions (intuitively, convert web pages into strings to extract valuable information through fuzzy matching. When the document is complex, this method will be very difficult to extract data), html. parser (that comes with Python), beautifulsoup (a third-party plug-in, you can use the html.parser that comes with Python for parsing, or you can use lxml for parsing, which is more powerful than the other ones), lxml (a third-party plug-in , can parse xml and HTML), html.parser, beautifulsoup and lxml are all parsed in the form of DOM tree.
Application: It is an application composed of useful data extracted from web pages.
What can a crawler do?
You can use a crawler to crawl pictures, crawl videos, and other data you want to crawl. As long as you can access the data through the browser, you can obtain it through the crawler.
What is the essence of a crawler?
Simulate the browser to open the web page and obtain the part of the data we want in the web page
The process of the browser opening the web page:
When you are in the browser After entering the address, the server host is found through the DNS server and a request is sent to the server. The server parses and sends the results to the user's browser, including html, js, css and other file contents. The browser parses it and finally presents it to the user on the browser. The results seen
So the results of the browser that the user sees are composed of HTML codes. Our crawler is to obtain these contents by analyzing and filtering the HTML codes to obtain the resources we want.
Related recommendations: "Python Tutorial"
The above is the detailed content of What does python's crawler mean?. For more information, please follow other related articles on the PHP Chinese website!