Q&A:
1. Why does it show that Encyclopedia of Embarrassing Things is unavailable for a period of time?
Answer: Some time ago, the Encyclopedia of Embarrassing Things added a Header test, which made it impossible to crawl. It is necessary to simulate the Header in the code. Now the code has been modified and can be used normally.
#2. Why do you need to create a new thread separately?
Answer: The basic process is as follows: the crawler starts a new thread in the background and crawls two pages of the Encyclopedia of Embarrassing Stories. If there are less than two pages left, it will crawl another page. When users press enter, they only get the latest content from the inventory instead of going online, so browsing is smoother. You can also put the loading in the main thread, but this will cause the problem of long waiting time during the crawling process.
Project content:
A web crawler for Encyclopedia of Embarrassing Things written in Python.
Usage:
Create a new Bug.py file, copy the code into it, and double-click to run it.
Program functions:
Browse the Encyclopedia of Embarrassing Things in the command prompt.
Principle explanation:
First of all, browse the homepage of Embarrassing Encyclopedia: http://www.qiushibaike.com/hot/page/1
Okay It can be seen that the number after page/ in the link is the corresponding page number. Remember this to prepare for future writing.
Then, right-click to view the page source code:
Observation found that each paragraph is marked with a div, where class must be content and title is the posting time , we only need to use regular expressions to "deduct" it.
After understanding the principle, the rest is the content of regular expressions. You can refer to this blog post:
http://blog.csdn.net/wxg694175346/article/details/ 8929576
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|