A web crawler is a program that automatically obtains web content and is an important part of search engines. Web crawlers download web pages from the World Wide Web for search engines. Generally divided into traditional crawlers and focused crawlers.
Learning reptiles is a step-by-step process. As a novice with zero foundation, it can be roughly divided into three stages. The first stage is to get started and master the necessary basic knowledge. The second stage is to imitate and follow others. Learn crawler code and understand every line of code. The third stage is to do it yourself. At this stage, you begin to have your own ideas for solving problems and can independently design a crawler system.
The technologies involved in crawlers include but are not limited to proficiency in a programming language (here, we take Python as an example), knowledge of HTML, basic knowledge of the HTTP/HTTPS protocol, regular expressions, database knowledge, and the use of common packet capture tools. , the use of crawler frameworks, involving large-scale crawlers, also requires understanding of the concept of distribution, message queues, commonly used data structures and algorithms, caching, and even the application of machine learning. Large-scale systems rely on many technologies To support. Crawlers are only for obtaining data. Analysis and mining of these data are the value. Therefore, it can also be extended to data analysis, data mining and other fields to make decisions for enterprises. Therefore, as a crawler engineer, there is a lot to do.
So do I have to learn all the above knowledge before I can start writing a crawler? Of course not, learning is a lifelong thing. As long as you can write Python code, you can start crawling directly. It is like learning to drive. As long as you can drive, you can hit the road. Of course, writing code is much safer than driving.
To write a crawler in Python, you first need to know Python, understand the basic syntax, and know how to use functions, classes, and common methods in common data structures such as list and dict, which is a basic introduction. Then you need to understand HTML. HTML is a document tree structure. There is a 30-minute introductory tutorial on HTML on the Internet that is enough. Then there is the knowledge about HTTP. The basic principle of a crawler is the process of downloading data from a remote server through a network request, and the technology behind this network request is based on the HTTP protocol. As an entry-level crawler, you need to understand the basic principles of the HTTP protocol. Although the HTTP specification cannot be written in one book, the in-depth content can be read later, combining theory with practice.
The network request framework is an implementation of the HTTP protocol. For example, the famous network request library Requests is a network library that simulates a browser sending HTTP requests. After understanding the HTTP protocol, you can specifically learn network-related modules. For example, Python comes with urllib, urllib2 (urllib in Python3), httplib, Cookie, etc. Of course, you can skip these directly. To learn how to use Requests directly, the premise is that you are familiar with the basic content of the HTTP protocol. A book I have to recommend here is "HTTP Illustrated". The data crawled down is mostly HTML text, and a few are data based on XML format or Json format. To process these data correctly, you need to be familiar with the solutions for each data type. For example, JSON data can be directly used in Python. For the module json of HTML data, you can use BeautifulSoup, lxml and other libraries to process it. For xml data, you can use third-party libraries such as untangle and xmltodict.
For entry-level crawlers, it is not necessary to learn regular expressions. You can learn them when you really need them. For example, after you crawl the data back, you need to clean the data. When you find that using regular expressions When the string operation method cannot be processed at all, then you can try to understand regular expressions, which can often get twice the result with half the effort. Python's re module can be used to process regular expressions. Here are also several tutorials recommended: 30-minute introductory tutorial on regular expressions Python regular expression guide Complete guide to regular expressions
After data cleaning, persistent storage must be carried out. You can use file storage, such as CSV files, You can also use database storage, simply use sqlite, more professional use MySQL, or the distributed document database MongoDB. These databases are very friendly to Python and have ready-made library support. Python operates the MySQL database and connects to the database through Python
The basic process from data capture to cleaning to storage has been completed. It can be considered a basic introduction. Next is the time to test your internal skills. Many websites have set up There are anti-crawler strategies. They try every means to prevent you from obtaining data by abnormal means. For example, there will be all kinds of weird verification codes to limit your request operations, limit the request speed, limit the IP, and even encrypt the data. , in short, to increase the cost of obtaining data. At this time, you need to master more knowledge. You need to understand the HTTP protocol in depth, you need to understand common encryption and decryption algorithms, you need to understand cookies in HTTP, HTTP proxy, and various HEADERs in HTTP. Reptiles and anti-reptiles are a pair that love each other and kill each other. Every time the Tao is high, the magic is high. There is no established unified solution for how to deal with anti-crawlers. It depends on your experience and the knowledge system you have mastered. This is not something you can achieve with just a 21-day introductory tutorial.
To carry out a large-scale crawler, we usually start crawling from a URL, and then add the URL link parsed in the page to the set of URLs to be crawled. We need to use queue or priority The queue is used to differentiate between crawling some websites first and crawling some websites later. Every time a page is crawled, whether to use the depth-first or breadth-first algorithm to crawl the next link. Every time a network request is initiated, a DNS resolution process is involved (converting the URL into an IP). In order to avoid repeated DNS resolution, we need to cache the resolved IP. There are so many URLs. How to determine which URLs have been crawled and which ones have not been crawled? To put it simply, use a dictionary structure to store the URLs that have been crawled. However, if a large number of URLs are encountered, the memory space occupied by the dictionary will be very large. At this time, you need to consider using a Bloom Filter to crawl data one by one with a thread. The efficiency is pitifully low. If you want to improve the crawler efficiency, you should use multi-threads, multi-processes, coroutines, or distributed operations.
There are a lot of crawler tutorials on the Internet. The principles are basically the same. They just change a different website to crawl. You can follow the online tutorials to learn to simulate logging in to a website and simulate checking in. Sort of like, go to Douban for movies, books, etc. Through continuous practice, from encountering problems to solving them, this kind of gain cannot be compared with reading a book.
Introduction to Python3 basic crawler
Python’s simplest web crawler tutorial
The above is the detailed content of Sharing experience on getting started with Python crawler. For more information, please follow other related articles on the PHP Chinese website!