We have been sharing knowledge related to Linux systems for a long time, so some friends may mistakenly think that we only share things related to Linux operations. In fact, this is not the case. We develop in daily life. Some of the problems encountered during the process, which I feel can be summarized, may be shared.
Recently I am writing a program for regularly accessing network resources, which involves using python to access the network, so we will make a brief summary of this point today.
To access resources on the network, some friends may use urllib.request. This module is also a python standard library. However, we use requests, which is a wrapper of urllib.request and is more convenient to use. If it is the first time to use it, you need to install requests. We use pip to install:
pip install --user requests
After successful installation, you can use it directly. When using it, you need to introduce it above the python file:
import requestsimport requests
For the convenience of demonstration, we use the requests module to access the python project with the highest star rating on github. The address is:
//m.sbmmt.com/link/62d90d223cf3e2239113a4963b191d71
In order to have an overall understanding, you can first use a browser to open this address and take a look at the content inside. It is a text displayed in json format.
Then we create a new test-resp.py file and enter the following code:
import requests url='//m.sbmmt.com/link/62d90d223cf3e2239113a4963b191d71'; get_resp=requests.get(url);
The first line in the above code introduces the requests module, Line 2 stores the address to be accessed in the url variable. Line 3 uses the get method of the requests module to access the url and stores the response in the variable get_resp. The response is an object containing the content and status of the requested resource. You can print the status of the response, using the status_code object, the following code:
print(get_resp.status_code)
The result is as follows:
$ python test-resp.py 200
Of course, you can also print out the text of the response, that is, print get_resp.text, but this The text is not formatted and is not easy to read, as shown in the figure below:
We only intercepted part of the result in the above figure because it is too long... This result is more troublesome to parse, but there is no need to worry at all. By simply looking at its content, we can notice that its content is in json format, so the response result is also a json, and python's parsing function for json is very powerful. We can print out the key value of this json as follows:
print(get_resp.json().keys())
The results are as follows:
$ python test-resp.py dict_keys(['total_count', 'incomplete_results', 'items'])
As can be seen from the above results, we can completely treat this response result as a json object. For example, the first keyi value total_count in the above result represents the total number of python warehouses. We can print this value as follows:
response_dict=get_resp.json(); print("Total repositories:", response_dict['total_count'])
The running result is as follows:
$ python test-resp.py Total repositories: 9128125
If the network resources read are in ordinary html format, you can use a third-party library BeautifulSoup, which can perfectly solve html parsing. We have also introduced BeautifulSoup in previous articles. You can refer to: Using Python's Beautiful Soup library to analyze web pages
Some network resources may respond to requests There are restrictions, such as preventing robots (programs) from accessing, or requiring login (with a user session) to access. To this end, you can add a request header to the request, simulate a browser in the request header, add user session information (token), etc. As shown below:
headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36', 'Authorization':'41d15146-c3f3-4c0b-b48b-b5210151a9df' } get_resp=requests.get(url,headers=headers,params=None)
In the above code, User-Agent in the headers object is the simulated browser information, and Authorization is the request token. You can also add other request header information as needed, as shown below:
header={ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235' }
The examples we used above are all processing get requests. The same applies to post requests, using requests The post method is as follows:
post_resp=requests.post(url,headers=headers,data=None,json=None)
The usage method is exactly the same as the get request.
The above is all the content we share this time, welcome to discuss.
The above is the detailed content of How to request network resources using Python. For more information, please follow other related articles on the PHP Chinese website!