This article mainly introduces how to use Beautiful Soup to parse dom.
Beautiful Soup provides some simple, python-style functions to handle navigation, search, modification of analysis trees and other functions. It is a toolbox that provides users with the data they need to crawl by parsing documents. Because it is simple, you can write a complete application without much code.
Beautiful Soup automatically converts input documents to Unicode encoding and output documents to UTF-8 encoding. You don't need to consider the encoding method unless the document does not specify an encoding method, in which case Beautiful Soup cannot automatically identify the encoding method. Then, you just need to specify the original encoding method.
Beautiful Soup has become an excellent python interpreter like lxml and html6lib, providing users with the flexibility to provide different parsing strategies or strong speed.
The blogger uses a mac. Here we only introduce the installation method of Beautiful Soup under mac. The installation of python third-party libraries is very simple. The author has always used pip to install.
Install pip
easy_install pip
Install Beautiful Soup
pip install beautifulsoup4
Sometimes when we use the pip install beautifulsoup4 command to install, an error will be reported. At this time, we need to add before the command sudo to gain permissions.
sudo pip install beautifulsoup4
So far our preparations are done, now we can start using Beautiful Soup. This article only uses some methods of bs4 (note: bs4=beautifulsoup4 in the article). If you want to know more about bs4, you can check out the bs4 official documentation. The poster’s project is very simple. Index.py and an html folder contain some. html static file
index.py file
# coding=utf-8import osfrom bs4 import BeautifulSoupimport sys #定义一个list来存放文件路径paths=[]#获取所有的文件路径def get_paths(): for fpathe,dirs,fs in os.walk('html'): for f in fs: #print os.path.join(fpathe,f) #将拼接好的path存放到list中 filepath=os.path.join(fpathe,f) #只放入.html后缀文件路径 if(os.path.splitext(f)[1]==".html"): paths.append(filepath)#读取html文件修改后并写入相应的文件中去def reset_file(path): #判断文件是否存在 if not os.path.isfile(path): raise TypeError(path + " does not exist") #读取文件,bs4自动将输入文档转换为Unicode编码, #输出文档转换为utf-8编码,bs4也可以直接读取html #字符串,例如BeautifulSoup('<p>content</p>') soup=BeautifulSoup(open(path)) #select是bs4提供的方法,和jquery的$选择器一样 #方便。可以标签(eg:p,title,p...)来查找,也 #也可以通过css的 class .和id #来查找,基本上和我们 #使用$一样。 #选取id="nav"节点下的所有li元素里面的a标签,返回值是一个list集合 nav_a=soup.select("#nav li a") #修改a的href属性 if(len(nav_a)>1): nav_a[0]["href"]="/m/" nav_a[1]["href"]="/m/about_mobile/m_about.html" #选取class="footer"里的所有a标签 footer_a=soup.select(".footer a") if(len(footer_a)>0): footer_a[1]["href"]="/m/about_mobile/m_sjdt.html" content_p=soup.select(".content p") #修改<p>我是string</p>里面的文本内容 if(len(content_p)>0): content_p[0].string="修改p标签里面的测试内容" #修改系统的默认编码 reload(sys) sys.setdefaultencoding('utf-8') #打开相应的文件写入模式,打开文件不要放入try里面,否则会 #出现异常 f=open(path,"w") try: #写入文件 f.write(soup.prettify()) finally: #关闭文件 file.close()#定义main函数程序的入口 if __name__=="__main__": get_paths() #遍历所有文件路径 for p in paths: reset_file(p)
The above is the detailed content of How to use Beautiful Soup to parse DOM. For more information, please follow other related articles on the PHP Chinese website!