Home >Backend Development >Python Tutorial >How to install the urllib2 library in Python

How to install the urllib2 library in Python

步履不停
步履不停Original
2019-07-02 13:11:2813366browse

How to install the urllib2 library in Python

urllib2 is a module that comes with Python2.7 (no need to download, just import it and use it).

Basic use of the urllib2 library

The so-called web page crawling is to read the network resources specified in the URL address from the network stream and save them locally. There are many libraries in Python that can be used to crawl web pages. Let’s learn urllib2 first.

urllib2 是 Python2.7 自带的模块(不需要下载,导入即可使用)
urllib2 官方文档:https://docs.python.org/2/library/urllib2.html
urllib2 源码:https://hg.python.org/cpython/file/2.7/Lib/urllib2.py

urllib2 was changed to urllib.request in python3.x

urlopen

Let’s start with a piece of code:

# urllib2_urlopen.py
 
# 导入urllib2 库
import urllib2
 
# 向指定的url发送请求,并返回服务器响应的类文件对象
response = urllib2.urlopen("http://www.baidu.com")
 
# 类文件对象支持 文件对象的操作方法,如read()方法读取文件全部内容,返回字符串
html = response.read()
 
# 打印字符串
print html

Execute the python code written and the results will be printed

Power@PowerMac ~$: python urllib2_urlopen.py

In fact, if we open the Baidu homepage on the browser, right-click and select "View Source Code", you will find that it is exactly the same as what we just printed . In other words, the above 4 lines of code have helped us crawl down all the code on Baidu's homepage.

The python code corresponding to a basic url request is really very simple.

Request

In our first example, the parameter of urlopen() is a url address;

But if you need to perform more complex Operations, such as adding HTTP headers, must create a Request instance as a parameter of urlopen(); and the URL address that needs to be accessed is used as a parameter of the Request instance.

We edit urllib2_request.py

# urllib2_request.py
 
import urllib2
 
# url 作为Request()方法的参数,构造并返回一个Request对象
request = urllib2.Request("http://www.baidu.com")
 
# Request对象作为urlopen()方法的参数,发送给服务器并接收响应
response = urllib2.urlopen(request)
 
html = response.read()
 
print html

The running result is exactly the same:

新建Request实例,除了必须要有 url 参数之外,还可以设置另外两个参数:
data(默认空):是伴随 url 提交的数据(比如要post的数据),同时 HTTP 请求将从 "GET"方式 改为 "POST"方式。
headers(默认空):是一个字典,包含了需要发送的HTTP报头的键值对。
这两个参数下面会说到。

User-Agent

But this is direct Using urllib2 to send a request to a website is indeed a bit abrupt. Just like every house has a door, it is obviously not very polite for you to barge in directly as a passerby. Moreover, some sites do not like to be visited by programs (non-human visits) and may deny your access request.

But if we use a legal identity to request other people's websites, they will obviously welcome it, so we should add an identity to our code, which is the so-called User-Agent header.

The browser is a recognized and allowed identity in the Internet world. If we want our crawler program to be more like a real user, then our first step is to pretend to be a recognized browser. . Different browsers will have different User-Agent headers when sending requests. The default User-Agent header of urllib2 is: Python-urllib/x.y (x and y are the Python major and minor version numbers, such as Python-urllib/2.7)

#urllib2_useragent.py
 
import urllib2
 
url = "http://www.itcast.cn"
 
#IE 9.0 的 User-Agent,包含在 ua_header里
ua_header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"} 
 
#  url 连同 headers,一起构造Request请求,这个请求将附带 IE9.0 浏览器的User-Agent
request = urllib2.Request(url, headers = ua_header)
 
# 向服务器发送这个请求
response = urllib2.urlopen(request)
 
html = response.read()
print html

Add more Header information

Add a specific Header to the HTTP Request to construct a complete HTTP request message.

You can add/modify a specific header by calling Request.add_header() or view existing headers by calling Request.get_header().

Add a specific header

# urllib2_headers.py
 
import urllib2
 
url = "http://www.itcast.cn"
 
#IE 9.0 的 User-Agent
header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"} 
request = urllib2.Request(url, headers = header)
 
#也可以通过调用Request.add_header() 添加/修改一个特定的header
request.add_header("Connection", "keep-alive")
 
# 也可以通过调用Request.get_header()来查看header信息
# request.get_header(header_name="Connection")
 
response = urllib2.urlopen(req)
 
print response.code     #可以查看响应状态码
html = response.read()
 
print html

Randomly add/modify User-Agent

# urllib2_add_headers.py
 
import urllib2
import random
 
url = "http://www.itcast.cn"
 
ua_list = [
    "Mozilla/5.0 (Windows NT 6.1; ) Apple.... ",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0)... ",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X.... ",
    "Mozilla/5.0 (Macintosh; Intel Mac OS... "
]
 
user_agent = random.choice(ua_list)
 
request = urllib2.Request(url)
 
#也可以通过调用Request.add_header() 添加/修改一个特定的header
request.add_header("User-Agent", user_agent)
 
# 第一个字母大写,后面的全部小写
request.get_header("User-agent")
 
response = urllib2.urlopen(req)
 
html = response.read()
print html

Related tutorial recommendations :Python video tutorial

The above is the detailed content of How to install the urllib2 library in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn