If you want to quickly learn crawlers, the most worthwhile language to learn must be Python. Python has many application scenarios, such as: rapid web development, crawlers, automated operation and maintenance, etc. It can be done simply Website, automatic posting script, email sending and receiving script, simple verification code recognition script.
There are also many reuse processes in the development process of crawlers. Today I will summarize the 8 essential skills, which can save time and effort in the future and complete tasks efficiently.
get method
import urllib2 url = "http://www.baidu.com" response = urllib2.urlopen(url) print response.read()
post method
import urllib import urllib2 url = "http://abcde.com" form = {'name':'abc','password':'1234'} form_data = urllib.urlencode(form) request = urllib2.Request(url,form_data) response = urllib2.urlopen(request) print response.read()
In the process of developing crawlers, we often encounter situations where the IP is blocked. In this case, we need to use the proxy IP; there is a ProxyHandler class in the urllib2 package. Through this class, we can set up a proxy to access the web page, as shown in the following code snippet:
import urllib2 proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) response = urllib2.urlopen('http://www.baidu.com') print response.read()
Cookies are data (usually encrypted) stored on the user's local terminal by some websites in order to identify the user's identity and perform session tracking. Python provides the cookielib module for processing cookies. , the main function of the cookielib module is to provide objects that can store cookies, so that it can be used in conjunction with the urllib2 module to access Internet resources. Search the public account on WeChat: Architect Guide, reply: Architect Get Information.
Code snippet:
import urllib2, cookielib cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar()) opener = urllib2.build_opener(cookie_support) urllib2.install_opener(opener) content = urllib2.urlopen('http://XXXX').read()
The key is CookieJar(), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. . The entire cookie is stored in memory, and the cookie will be lost after garbage collection of the CookieJar instance. All processes do not need to be operated separately.
Add cookies manually:
cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg=" request.add_header("Cookie", cookie)
Some websites are disgusted with the visit of crawlers, so they reject requests from crawlers. Therefore, HTTP Error 403: Forbidden often occurs when using urllib2 to directly access the website.
Pay special attention to some headers. The server will check these headers:
This can be achieved by modifying the header in the http package. The code snippet is as follows:
import urllib2 headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' } request = urllib2.Request( url = 'http://my.oschina.net/jhao104/blog?catalog=3463517', headers = headers ) print urllib2.urlopen(request).read()
The most powerful tool for page parsing is of course regular expressions. Expression, this is different for different users of different websites, so there is no need to explain too much
The second is the parsing library, the two commonly used ones are lxml and BeautifulSoup
For these two libraries, my evaluation is that they are both HTML/XML processing libraries. Beautifulsoup is implemented purely in python, which is inefficient, but has practical functions. For example, the source code of an HTML node can be obtained through search results; lxml C language coding, efficient, supports Xpath.
For some simple verification codes, simple identification can be performed. I have only done some simple verification code recognition. However, some anti-human verification codes, such as 12306, can be manually coded through the coding platform. Of course, this requires a fee.
Have you ever encountered some web pages that are garbled no matter how they are transcoded? Haha, that means you don’t know that many web services have the ability to send compressed data, which can reduce the large amount of data transmitted on network lines by more than 60%. This is especially true for XML web services, since XML data can be compressed to a very high degree.
But generally the server will not send compressed data for you unless you tell the server that you can handle compressed data.
So you need to modify the code like this:
import urllib2, httplib request = urllib2.Request('http://xxxx.com') request.add_header('Accept-encoding', 'gzip') opener = urllib2.build_opener() f = opener.open(request)
Then it’s time to decompress the data:
import StringIO import gzip compresseddata = f.read() compressedstream = StringIO.StringIO(compresseddata) gzipper = gzip.GzipFile(fileobj=compressedstream) print gzipper.read()
If a single thread is too slow, multi-threading is needed. Here is one This simple thread pool template program simply prints 1-10, but it can be seen that it is concurrent.
Although Python's multi-threading is useless, it can still improve efficiency to a certain extent for network-frequent crawlers.
from threading import Thread from Queue import Queue from time import sleep # q是任务队列 #NUM是并发线程总数 #JOBS是有多少任务 q = Queue() NUM = 2 JOBS = 10 #具体的处理函数,负责处理单个任务 def do_somthing_using(arguments): print arguments #这个是工作进程,负责不断从队列取数据并处理 def working(): while True: arguments = q.get() do_somthing_using(arguments) sleep(1) q.task_done() #fork NUM个线程等待队列 for i in range(NUM): t = Thread(target=working) t.setDaemon(True) t.start() #把JOBS排入队列 for i in range(JOBS): q.put(i) #等待所有JOBS完成 q.join()
The above is the detailed content of Conscience recommendation! 8 essential skills for Python crawler masters!. For more information, please follow other related articles on the PHP Chinese website!