Home >Backend Development >Python Tutorial >Practical use of crawlers in Python: Zhihu crawlers
In today's Internet era, the information we need can be said to be all-encompassing, but how to obtain this information is often a problem. One of the commonly used methods is to collect information through web crawlers. Regarding the writing of web crawlers, the Python language is often one of the most popular tools. In this article, we will describe how to use Python to write a web crawler based on Zhihu.
Zhihu is a well-known social question and answer website, which is very important for the integration and summary of information. We can use crawlers to obtain questions, answers, user information, etc. on the website. Here, we mainly introduce how to obtain Zhihu user information.
First of all, we need to use the common libraries of Python crawlers - Requests and BeautifulSoup. The Requests library can help us obtain the content of the web page, and the BeautifulSoup library can help us parse the content of the web page and obtain the information we need. These two libraries need to be installed before use.
After the installation is completed, we can first obtain the Zhihu user's homepage through the Requests library, for example:
import requests url = 'https://www.zhihu.com/people/zionyang/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) html = response.text
In the above code, we obtain the Zhihu user through the get method in the Requests library" zionyang"'s homepage. Among them, the headers parameter is added to avoid being recognized by the anti-crawler mechanism.
After obtaining the source code of the web page, we can use BeautifulSoup to parse the HTML content. As shown in the following code:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')
Set the parser to "lxml", and then we can use the powerful function of BeautifulSoup to parse HTML documents. Below are some commonly used parsing methods.
questions = soup.find_all('a', {'data-nav':'question'}) for question in questions: print(question.text)
name = soup.find('span', {'class': 'ProfileHeader-name'}).text
education = soup.select('li.ProfileEducationList-item')[0].select('div.ProfileEducationList-degreeName')[0].text
Through the above method, we can obtain various contents in Zhihu user information. It should be noted that when accessing the user's homepage without logging in to the web page, we can only obtain the user's basic information, and we cannot even obtain private information such as gender.
While obtaining user information, we can also obtain the user's attention, fans, likes and other data. We can use tools such as Fiddler to capture packets to obtain the URL corresponding to the required data, and then access it through the Requests library:
url = 'https://www.zhihu.com/people/zionyang/followers' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3', 'Cookie': 'your_cookie' } response = requests.get(url, headers=headers) data = response.json()
Among them, our own Cookie information needs to be added to the headers parameter, otherwise we will not be able to obtain it. required data.
Through the above methods, we can use Python to write our own web crawler and obtain massive information. Of course, during the crawling process, you need to pay attention to comply with the relevant regulations of the website to avoid affecting the website. At the same time, you must also pay attention to the protection of personal information. I hope the introduction in this article will be helpful to beginners.
The above is the detailed content of Practical use of crawlers in Python: Zhihu crawlers. For more information, please follow other related articles on the PHP Chinese website!