Use python2 and python3 to disguise browsers and crawl web content-Python Tutorial-php.cn

Use python2 and python3 to disguise browsers and crawl web content

高洛峰

Release： 2016-10-18 13:55:38

Original

1941 people have browsed it

Python web page crawling function is very powerful. You can easily crawl web page content using urllib or urllib2. But many times we have to pay attention to the fact that many websites may have anti-collection functions, so it is not so easy to capture the content you want.

Today I will share how to simulate browsers to skip blocking and crawl when downloading python2 and python3.

The most basic crawling:

#! /usr/bin/env python
# -*- coding=utf-8 -*-
# @Author pythontab
import urllib.request
url = "http://www.pythontab.com"
html = urllib.request.urlopen(url).read(）
print(html)

Copy after login

But...some websites cannot be crawled and have anti-collection settings, so we have to change the method

python2 (the latest stable version python2.7)

#! /usr/bin/env python
# -*- coding=utf-8 -*-
# @Author pythontab.com
import urllib2
url="http://pythontab.com"
req_header = {&#39;User-Agent&#39;:&#39;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11&#39;,
             &#39;Accept&#39;:&#39;text/html;q=0.9,*/*;q=0.8&#39;,
             &#39;Accept-Charset&#39;:&#39;ISO-8859-1,utf-8;q=0.7,*;q=0.3&#39;,
             &#39;Accept-Encoding&#39;:&#39;gzip&#39;,
             &#39;Connection&#39;:&#39;close&#39;,
             &#39;Referer&#39;:None #注意如果依然不能抓取的话，这里可以设置抓取网站的host
             }
req_timeout = 5
req = urllib2.Request(url,None,req_header)
resp = urllib2.urlopen(req,None,req_timeout)
html = resp.read()
print(html)

Copy after login

python3 (Latest stable version python3.3)

#! /usr/bin/env python
# -*- coding=utf-8 -*-
# @Author pythontab
import urllib.request
  
url = "http://www.pythontab.com"
headers = {&#39;User-Agent&#39;:&#39;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11&#39;,
             &#39;Accept&#39;:&#39;text/html;q=0.9,*/*;q=0.8&#39;,
             &#39;Accept-Charset&#39;:&#39;ISO-8859-1,utf-8;q=0.7,*;q=0.3&#39;,
             &#39;Accept-Encoding&#39;:&#39;gzip&#39;,
             &#39;Connection&#39;:&#39;close&#39;,
             &#39;Referer&#39;:None #注意如果依然不能抓取，这里可以设置抓取网站的host
             }
  
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read()
print(data)

Copy after login