Solution to 403 error in python crawler-Python Tutorial-php.cn

This article mainly introduces the relevant information of python crawler to solve 403 forbidden access error. Friends who need it can refer to it

python crawler solves 403 forbidden access error

When writing a crawler in Python, html.getcode() will encounter the problem of 403 forbidden access. This is a ban on automated crawlers on the website. To solve this problem, you need to use the python module urllib2 module

The urllib2 module is an advanced crawler module. There are many methods. For example, if you connect url=http://blog.csdn.NET/qysh123, there may be a 403 access forbidden problem for this connection.

To solve this problem, the following steps are required:

req = urllib2.Request(url) req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36") req.add_header("GET",url) req.add_header("Host","blog.csdn.net") req.add_header("Referer","http://blog.csdn.net/")

Copy after login

User-Agent is a browser-specificattribute, which can be seen by viewing the source code through the browser

Then

html=urllib2.urlopen(req) print html.read()

Copy after login

you can download all the web page code without the problem of 403 forbidden access.

For the above problems, it can be encapsulated into afunctionfor easy use in the future. The specific code:

#-*-coding:utf-8-*- import urllib2 import random url="http://blog.csdn.net/qysh123/article/details/44564943" my_headers=["Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14", "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)" ] def get_content(url,headers): ''''' @获取403禁止访问的网页 ''' randdom_header=random.choice(headers) req=urllib2.Request(url) req.add_header("User-Agent",randdom_header) req.add_header("Host","blog.csdn.net") req.add_header("Referer","http://blog.csdn.net/") req.add_header("GET",url) content=urllib2.urlopen(req).read() return content print get_content(url,my_headers)

Copy after login

The random function is used to automatically obtain the already written For browser-type User-Agent information, you need to write your own Host, Referer, GET information, etc. inCustom Function. After solving these problems, you can access smoothly and no more 403 access will occur. Information.

Of course, if the access frequency is too fast, some websites will still be filtered. To solve this problem, you need to use a proxy IP method. . . Solve it yourself specifically

[Related recommendations]

1.Special recommendation:"php programmer toolbox" V0.1 version download

2.Python free video tutorial

3.Python application in data science video tutorial

The above is the detailed content of Solution to 403 error in python crawler. For more information, please follow other related articles on the PHP Chinese website!