Crawler, known as a network robot, is more commonly known as a web page chaser in the FOAF community. It is a program or script that automatically crawls World Wide Web information according to certain rules. It is mainly used in search engines. It reads all the content and links of a website, builds relevant full-text indexes into the database, and then jumps to another website. Traditional crawlers start from the URL of one or several initial web pages, obtain the URL on the initial web page, and then continuously extract new URLs from the current page and put them into the queue until certain stopping conditions of the system are met.
Preparation before studying
1. A love for learning
2.An unyielding heart A computer with a keyboard (any system will do. I use os x, so the examples will be based on this)
3. Some preliminary knowledge related to html. No need to be proficient, just a little understanding is enough! Basic syntax knowledge of Python.
The specific learning route
is generally divided into three major aspects:
1. Simple directed script crawler (request -- - bs4 --- re)
2. Large-scale framework crawler (Scrapy framework mainly)
3. Browser simulation crawler (Mechanize simulation and Selenium simulation)
Specific steps:
1. Installation and use of Beautiful Soup
requests library, install beautiful soup crawler environment, beautiful soup parser, re library regular rules The use of expressions, bs4 crawler practice. Get the content of Baidu Tieba bs4 crawler practice, get Shuangseqiu winning information bs4 crawler practice, get the starting point novel information bs4 crawler practice, get the movie information bs4 crawler practice. Get the Yueyin Channel list
2, Scrapy crawler framework
Install Scrapy, selector Xpath and CSSScrapy crawler practice in Scrapy, today's film and television Scrapy crawler practice, weather forecast Scrapy crawler practice, get Agent Scrapy crawler practice, Embarrassing Encyclopedia Scrapy crawler practice, crawler-related offense and defense (agent pool related)
3. Browser simulation crawler
Installation and use of Mechanize module, use Mechanize to obtain music station announcements , Installation and use of Selenium modules, browser selection PhantomJS, Selenium & PhantomJS practice, obtaining agents; Selenium & PhantomJS practice, comic crawler.
The above is the detailed content of What to learn about python crawler. For more information, please follow other related articles on the PHP Chinese website!