In web development, we often encounter web page crawling and analysis, and various languages can complete this function. I like to use python to implement it, because python provides many mature modules, which can easily implement web crawling.
However, you will encounter encoding problems during the crawling process. Today we will look at how to determine the encoding of a web page:
The encoding format of many web pages on the Internet is different. Generally speaking, GBK, GB2312, UTF-8, etc.
After we obtain the data of the web page, we must first judge the encoding of the web page, and then we can uniformly convert the encoding of the captured content into an encoding that we can handle to avoid the occurrence of garbled code problems.
The following introduces two methods of judging web page encoding:
Summary: The second method is very accurate. It is best to use the python module to analyze the content when analyzing web page encoding. Accurate, but the method of analyzing meta header information is less accurate.
Method 1: Use the getparam method of the urllib module
##
import urllib #autor:pythontab.com fopen1 = urllib.urlopen('http://www.baidu.com').info() print fopen1.getparam('charset')# baidu
Method 2: Use the chardet module
#如果你的python没有安装chardet模块,你需要首先安装一下chardet判断编码的模块哦 #author:pythontab.com import chardet import urllib #先获取网页内容 data1 = urllib.urlopen('http://www.baidu.com').read() #用chardet进行内容分析 chardit1 = chardet.detect(data1) print chardit1['encoding'] # baidu