[Python] Web Crawler (5): Usage details of urllib2 and website crawling techniques-Python Tutorial-php.cn

I mentioned a simple introduction to urllib2 earlier, and here are some details of how to use urllib2.

1. Proxy settings

urllib2 will use the environment variable http_proxy to set HTTP Proxy by default.

If you want to explicitly control Proxy in the program without being affected by environment variables, you can use a proxy.

Create a new test14 to implement a simple proxy Demo:

import urllib2  
enable_proxy = True  
proxy_handler = urllib2.ProxyHandler({"http" : &#39;http://some-proxy.com:8080&#39;})  
null_proxy_handler = urllib2.ProxyHandler({})  
if enable_proxy:  
    opener = urllib2.build_opener(proxy_handler)  
else:  
    opener = urllib2.build_opener(null_proxy_handler)  
urllib2.install_opener(opener)

Copy after login

One detail to note here is that using urllib2.install_opener() will set the global opener of urllib2.

This will be very convenient for later use, but it cannot provide more detailed control, for example, if you want to use two different Proxy settings in the program, etc.

A better approach is not to use install_opener to change the global settings, but to directly call the opener's open method instead of the global urlopen method.

2.Timeout setting
In the old version of Python (before Python2.6), the API of urllib2 does not expose the Timeout setting. To set the Timeout value, you can only change the global Timeout of the Socket. value.

import urllib2  
import socket  
socket.setdefaulttimeout(10) # 10 秒钟后超时  
urllib2.socket.setdefaulttimeout(10) # 另一种方式

Copy after login

After Python 2.6, the timeout can be set directly through the timeout parameter of urllib2.urlopen().

import urllib2  
response = urllib2.urlopen(&#39;http://www.google.com&#39;, timeout=10)

Copy after login

3. Add a specific Header to the HTTP Request

To add a header, you need to use the Request object:

import urllib2  
request = urllib2.Request(&#39;http://www.baidu.com/&#39;)  
request.add_header(&#39;User-Agent&#39;, &#39;fake-client&#39;)  
response = urllib2.urlopen(request)  
print response.read()

Copy after login

Pay special attention to some headers, the server These headers will be checked
User-Agent: Some servers or Proxy will use this value to determine whether the request is made by the browser
Content-Type: When using the REST interface, the server will check this value, use To determine how the content in the HTTP Body should be parsed. Common values are:
application/xml: Use
application/json when calling XML RPC, such as RESTful/SOAP: Use
application/x-www-form-urlencoded when calling JSON RPC: Used when the browser submits a web form
When using the RESTful or SOAP service provided by the server, the Content-Type setting error will cause the server to deny service

4.Redirect
urllib2 By default, the redirect action will be automatically performed for HTTP 3XX return codes without manual configuration. To detect whether a redirect action has occurred, just check whether the URL of the Response and the URL of the Request are consistent.

import urllib2  
my_url = &#39;http://www.google.cn&#39;  
response = urllib2.urlopen(my_url)  
redirected = response.geturl() == my_url  
print redirected  
  
my_url = &#39;http://rrurl.cn/b1UZuP&#39;  
response = urllib2.urlopen(my_url)  
redirected = response.geturl() == my_url  
print redirected

Copy after login

If you don’t want to redirect automatically, in addition to using the lower-level httplib library, you can also customize the HTTPRedirectHandler class.

import urllib2  
class RedirectHandler(urllib2.HTTPRedirectHandler):  
    def http_error_301(self, req, fp, code, msg, headers):  
        print "301"  
        pass  
    def http_error_302(self, req, fp, code, msg, headers):  
        print "303"  
        pass  
  
opener = urllib2.build_opener(RedirectHandler)  
opener.open(&#39;http://rrurl.cn/b1UZuP&#39;)

Copy after login

5.Cookie

urllib2 also handles cookies automatically. If you need to get the value of a certain Cookie item, you can do this:

import urllib2  
import cookielib  
cookie = cookielib.CookieJar()  
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))  
response = opener.open(&#39;http://www.baidu.com&#39;)  
for item in cookie:  
    print &#39;Name = &#39;+item.name  
    print &#39;Value = &#39;+item.value

Copy after login

After running, the cookie value for visiting Baidu will be output:

[Python] Web Crawler (5): Usage details of urllib2 and website crawling techniques

6. Use HTTP PUT and DELETE methods

urllib2 only supports HTTP GET and POST methods. If you want to use HTTP PUT and DELETE, you can only use the lower-level httplib library. Even so, we can still enable urllib2 to issue PUT or DELETE requests in the following way:

import urllib2  
request = urllib2.Request(uri, data=data)  
request.get_method = lambda: &#39;PUT&#39; # or &#39;DELETE&#39;  
response = urllib2.urlopen(request)

Copy after login

7. Get the HTTP return code

For 200 OK, just use urlopen The HTTP return code can be obtained by using the getcode() method of the returned response object. But for other return codes, urlopen will throw an exception. At this time, it is necessary to check the code attribute of the exception object:

import urllib2  
try:  
    response = urllib2.urlopen(&#39;http://bbs.csdn.net/why&#39;)  
except urllib2.HTTPError, e:  
    print e.code

Copy after login

8.Debug Log

When using urllib2, you can open the debug Log through the following method, so that the contents of the send and receive packets are It will be printed on the screen to facilitate debugging. Sometimes you can save the work of capturing packets

import urllib2  
httpHandler = urllib2.HTTPHandler(debuglevel=1)  
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)  
opener = urllib2.build_opener(httpHandler, httpsHandler)  
urllib2.install_opener(opener)  
response = urllib2.urlopen(&#39;http://www.google.com&#39;)

Copy after login

In this way, you can see the contents of the transmitted data packets:

[Python] Web Crawler (5): Usage details of urllib2 and website crawling techniques

9. Form processing

It is necessary to fill in the form when logging in. How to fill in the form?

First use the tool to intercept the content of the form to be filled out.
For example, I usually use the firefox+httpfox plug-in to see what packages I have sent.
Taking verycd as an example, first find the POST request you sent and the POST form items.
You can see that for verycd, you need to fill in username, password, continueURI, fk, login_submit. Among them, fk is randomly generated (actually not too random, it looks like it is generated by simply encoding the epoch time). It needs to be obtained from the web page, which means that you must first visit the web page and use tools such as regular expressions to intercept the fk item in the returned data. As the name suggests, continueURI can be written casually, while login_submit is fixed, which can be seen from the source code. There is also username and password, which is obvious:

# -*- coding: utf-8 -*-  
import urllib  
import urllib2  
postdata=urllib.urlencode({  
    &#39;username&#39;:&#39;汪小光&#39;,  
    &#39;password&#39;:&#39;why888&#39;,  
    &#39;continueURI&#39;:&#39;http://www.verycd.com/&#39;,  
    &#39;fk&#39;:&#39;&#39;,  
    &#39;login_submit&#39;:&#39;登录&#39;  
})  
req = urllib2.Request(  
    url = &#39;http://secure.verycd.com/signin&#39;,  
    data = postdata  
)  
result = urllib2.urlopen(req)  
print result.read()

Copy after login

10. Disguise as a browser to visit
Some websites are disgusted with the visit of crawlers, so they reject requests from crawlers
At this time we need Pretending to be a browser, this can be achieved by modifying the header in the http package

#…  
  
headers = {  
    &#39;User-Agent&#39;:&#39;Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6&#39;  
}  
req = urllib2.Request(  
    url = &#39;http://secure.verycd.com/signin/*/http://www.verycd.com/&#39;,  
    data = postdata,  
    headers = headers  
)  
#...

Copy after login

11. Dealing with "anti-hotlinking"
Some sites have so-called anti-hotlinking settings. In fact, it is very simple to put it bluntly. ,

It is to check whether the referer site in the header you send the request is its own,

So we only need to change the referer of the headers to the Just use the website, take cnbeta as an example:

#...
headers = {
    &#39;Referer&#39;:&#39;http://www.cnbeta.com/articles&#39;
}
#...

Copy after login

headers is a dict data structure, you can put in any header you want to make some disguise.

For example, some websites like to read the X-Forwarded-For in the header to see their real IP. You can directly change the X-Forwarde-For.

The above is [Python] Web Crawler (5): Usage details of urllib2 and website crawling techniques. For more related content, please pay attention to the PHP Chinese website (m.sbmmt.com)!