In the era of the Internet, information has become extremely important, data has become one of the cornerstones of value, and web crawlers are one of the important tools for obtaining and processing data. The Python language has become the programming language of choice for many crawler programmers due to its simplicity, ease of learning, and efficiency. In this article, we will use Python language to crawl the data of Kuwo Music website through a practical case, and perform data analysis and processing.
Kuwo Music is one of the more well-known music players in China, with a large number of music resources and user groups. We will take the Kuwo Music website as an example to introduce the specific implementation process of crawling data.
1. Data analysis
Before crawling, we first need to analyze the web page structure and data storage method of the target site. By opening the Kuwo Music webpage, you can find that there is an obvious correlation between the webpage address and the music ID. Add "/song/" and the music ID after the webpage address to access the detailed page of the corresponding music.
Open the detailed page of a music and find that there is a lot of valuable data, including song name, singer, album, song duration, playback volume, number of comments, etc. This information is stored in HTML files in the form of web page tags. By looking at the page source code, you can find that most of the relevant information is hidden in tags with classes "__songinfo__" and "__detailed_info clearfix__".
2. Crawler implementation
The core of the crawler is to crawl data. We implement data crawling and saving separately.
We need to define a function that receives a list containing music IDs, accesses the page corresponding to the music and crawls the useful information. The specific implementation is as follows:
import requests from bs4 import BeautifulSoup def get_music_info(musicids): musicinfo = [] for musicid in musicids: url = 'http://www.kuwo.cn/play_detail/' + str(musicid) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) response.encoding = response.apparent_encoding soup = BeautifulSoup(response.text, 'html.parser') music_title = soup.find('h1', class_='info_tit').text.strip() # 歌曲名称 artist = soup.find('p', class_='name').text.strip() # 歌手 album = soup.find('a', class_='sname').text.strip() # 专辑 duration = soup.find('span', class_='hilight').text.strip() # 歌曲时长 play_counts = soup.find('em', class_='num').text.strip() # 播放量 comments_counts = soup.find('em', class_='sub').text.strip() # 评论数 musicinfo.append([musicid, music_title, artist, album, duration, play_counts, comments_counts]) print('正在爬取音乐《{}》信息'.format(music_title)) return musicinfo
The above code uses the requests library and BeautifulSoup library to request web pages and parse HTML files to obtain useful tag information. Among them, headers are disguised headers that simulate Chrome browser access to prevent being blocked by the server.
2. Data Saving
We will save the crawled data in CSV format. Before using it, we need to import the CSV library:
import csv
Then, we need to define a The function of saving data saves the crawled music information to a local file in the correct CSV format. The specific implementation is as follows:
def save_csv(save_path, data_list): with open(save_path, 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['歌曲ID', '歌曲名称', '歌手', '专辑', '歌曲时长', '播放量', '评论数']) writer.writerows(data_list) print("数据已保存至{}".format(save_path))
The above code uses the writer() method in the CSV library to save the music Information is written to the file. It should be noted that the file delimiter in the CSV file is a comma. When writing to the file, you need to use newline='' to fix the blank lines between lines.
3. Data Analysis
After completing data crawling and saving, we can start analyzing and processing the data. In the Python language, libraries such as pandas and matplotlib can easily implement data analysis and visualization.
1. Import library
Data analysis mainly uses pandas and matplotlib libraries, therefore, we need to use the following code to import related libraries:
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
2. Read files
We can use the read_csv() function in the pandas library to read the saved CSV file into the DataFrame. The specific implementation is as follows:
music_data = pd.read_csv('music_data.csv')
3. Data filtering and sorting
We can use the sort_values() method in pandas to sort the data in descending order according to play_counts, and use the head() method to retain only the first 20 data.
top_20_play_counts = music_data.sort_values('播放量', ascending=False).head(20)
4. Data visualization
Using the matplotlib library to achieve data visualization can provide a clearer understanding of the relationships and trends between data. We can use the following code to draw a music curve chart of the top 20 Kuwo music plays.
plt.figure(figsize=(20, 8)) # 设置图像大小 sns.lineplot(x='歌曲名称', y='播放量', data=top_20_play_counts) # 绘制曲线图 plt.xticks(rotation=90, fontsize=14) # 调整x轴刻度大小和旋转角度 plt.yticks(fontsize=14) # 调整y轴刻度大小 plt.xlabel('歌曲名称', fontsize=16) # 坐标轴标题 plt.ylabel('播放量', fontsize=16) plt.title('酷我音乐播放量排名前20的歌曲', fontsize=20) # 图像标题 plt.show() # 显示图像
Through the above code, we can more intuitively understand the playback trend of the top 20 songs of Kuwo Music.
4. Summary
This article uses a practical case to describe in detail the use of Python language in actual crawler combat. By analyzing the web page structure and data storage method, using requests and BeautifulSoup libraries for data crawling, and finally using pandas and matplotlib libraries for data analysis and visualization. I hope to have a better understanding of the application of Python language in the crawler field in practice.
The above is the detailed content of Practical use of crawlers in Python: Kuwo music crawler. For more information, please follow other related articles on the PHP Chinese website!