Python is quite good for data processing. If you want to do a crawler, Python is a good choice. It has many pre-written class packages that can complete many complex functions as long as they are called.
1 Pyhton gets the content of the web page (that is, the source code) (recommended learning: Python video tutorial)
page = urllib2.urlopen(url) contents = page.read() #获得了整个网页的内容也就是源代码 print(contents)
url represents the URL, contents represents the source code corresponding to the URL, urllib2 is the package that needs to be used, the above three lines of code can get the entire source code of the web page
2 Obtain the desired content in the webpage (first obtain the webpage source code, then analyze the webpage source code, find the corresponding tag, and then extract the content in the tag)
Take Douban movie ranking as an example
Now I need to get the names, ratings, number of reviews, and links of all movies on the current page
#coding:utf-8 ''''' @author: jsjxy ''' import urllib2 import re from bs4 import BeautifulSoup from distutils.filelist import findall page = urllib2.urlopen('http://movie.douban.com/top250?format=text') contents = page.read() #print(contents) soup = BeautifulSoup(contents,"html.parser") print("豆瓣电影TOP250" + "\n" +" 影片名 评分 评价人数 链接 ") for tag in soup.find_all('div', class_='info'): # print tag m_name = tag.find('span', class_='title').get_text() m_rating_score = float(tag.find('span',class_='rating_num').get_text()) m_people = tag.find('div',class_="star") m_span = m_people.findAll('span') m_peoplecount = m_span[3].contents[0] m_url=tag.find('a').get('href') print( m_name+" " + str(m_rating_score) + " " + m_peoplecount + " " + m_url )
Console output, you can also write it to a file
More Python related technologies Article, please visit the Python Tutorial column to learn!
The above is the detailed content of How to get web content in python. For more information, please follow other related articles on the PHP Chinese website!