urllib2 is a module that comes with Python2.7 (no need to download, just import it and use it).
Basic use of the urllib2 library
The so-called web page crawling is to read the network resources specified in the URL address from the network stream and save them locally. There are many libraries in Python that can be used to crawl web pages. Let’s learn urllib2 first.
urllib2 是 Python2.7 自带的模块(不需要下载,导入即可使用) urllib2 官方文档:https://docs.python.org/2/library/urllib2.html urllib2 源码:https://hg.python.org/cpython/file/2.7/Lib/urllib2.py
urllib2 was changed to urllib.request in python3.x
urlopen
Let’s start with a piece of code:
# urllib2_urlopen.py # 导入urllib2 库 import urllib2 # 向指定的url发送请求,并返回服务器响应的类文件对象 response = urllib2.urlopen("http://www.baidu.com") # 类文件对象支持 文件对象的操作方法,如read()方法读取文件全部内容,返回字符串 html = response.read() # 打印字符串 print html
Execute the python code written and the results will be printed
Power@PowerMac ~$: python urllib2_urlopen.py
In fact, if we open the Baidu homepage on the browser, right-click and select "View Source Code", you will find that it is exactly the same as what we just printed . In other words, the above 4 lines of code have helped us crawl down all the code on Baidu's homepage.
The python code corresponding to a basic url request is really very simple.
Request
In our first example, the parameter of urlopen() is a url address;
But if you need to perform more complex Operations, such as adding HTTP headers, must create a Request instance as a parameter of urlopen(); and the URL address that needs to be accessed is used as a parameter of the Request instance.
We edit urllib2_request.py
# urllib2_request.py import urllib2 # url 作为Request()方法的参数,构造并返回一个Request对象 request = urllib2.Request("http://www.baidu.com") # Request对象作为urlopen()方法的参数,发送给服务器并接收响应 response = urllib2.urlopen(request) html = response.read() print html
The running result is exactly the same:
新建Request实例,除了必须要有 url 参数之外,还可以设置另外两个参数: data(默认空):是伴随 url 提交的数据(比如要post的数据),同时 HTTP 请求将从 "GET"方式 改为 "POST"方式。 headers(默认空):是一个字典,包含了需要发送的HTTP报头的键值对。 这两个参数下面会说到。
User-Agent
But this is direct Using urllib2 to send a request to a website is indeed a bit abrupt. Just like every house has a door, it is obviously not very polite for you to barge in directly as a passerby. Moreover, some sites do not like to be visited by programs (non-human visits) and may deny your access request.
But if we use a legal identity to request other people's websites, they will obviously welcome it, so we should add an identity to our code, which is the so-called User-Agent header.
The browser is a recognized and allowed identity in the Internet world. If we want our crawler program to be more like a real user, then our first step is to pretend to be a recognized browser. . Different browsers will have different User-Agent headers when sending requests. The default User-Agent header of urllib2 is: Python-urllib/x.y (x and y are the Python major and minor version numbers, such as Python-urllib/2.7)
#urllib2_useragent.py import urllib2 url = "http://www.itcast.cn" #IE 9.0 的 User-Agent,包含在 ua_header里 ua_header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"} # url 连同 headers,一起构造Request请求,这个请求将附带 IE9.0 浏览器的User-Agent request = urllib2.Request(url, headers = ua_header) # 向服务器发送这个请求 response = urllib2.urlopen(request) html = response.read() print html
Add more Header information
Add a specific Header to the HTTP Request to construct a complete HTTP request message.
You can add/modify a specific header by calling Request.add_header() or view existing headers by calling Request.get_header().
Add a specific header
# urllib2_headers.py import urllib2 url = "http://www.itcast.cn" #IE 9.0 的 User-Agent header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"} request = urllib2.Request(url, headers = header) #也可以通过调用Request.add_header() 添加/修改一个特定的header request.add_header("Connection", "keep-alive") # 也可以通过调用Request.get_header()来查看header信息 # request.get_header(header_name="Connection") response = urllib2.urlopen(req) print response.code #可以查看响应状态码 html = response.read() print html
Randomly add/modify User-Agent
# urllib2_add_headers.py import urllib2 import random url = "http://www.itcast.cn" ua_list = [ "Mozilla/5.0 (Windows NT 6.1; ) Apple.... ", "Mozilla/5.0 (X11; CrOS i686 2268.111.0)... ", "Mozilla/5.0 (Macintosh; U; PPC Mac OS X.... ", "Mozilla/5.0 (Macintosh; Intel Mac OS... " ] user_agent = random.choice(ua_list) request = urllib2.Request(url) #也可以通过调用Request.add_header() 添加/修改一个特定的header request.add_header("User-Agent", user_agent) # 第一个字母大写,后面的全部小写 request.get_header("User-agent") response = urllib2.urlopen(req) html = response.read() print html
Related tutorial recommendations :Python video tutorial
The above is the detailed content of How to install the urllib2 library in Python. For more information, please follow other related articles on the PHP Chinese website!