How to prevent IP being blocked by python crawler

little bottle

Release： 2019-04-10 17:07:35

forward

2997 people have browsed it

When writing a crawler to crawl data, especially when crawling a large amount of data, because many websites have anti-crawler measures, it is easy to have their IP blocked and cannot continue to crawl. This article summarizes some countermeasures on how to solve this problem. These measures can be used alone or at the same time for better results.

Fake User-Agent

Set the User-Agent in the request header to the User-Agent in the browser to fake browser access. For example:

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
resp = requests.get(url,headers = headers)

Copy after login

Set a random time interval between each repeated crawling

# 比如：
time.sleep(random.randint(0,3))  # 暂停0~3秒的整数秒，时间区间：[0,3]
# 或：
time.sleep(random.random())  # 暂停0~1秒，时间区间：[0,1)

Copy after login

Fake cookies

If you can access a page normally from the browser, you can Copy the cookies in the browser and use them, for example:

cookies = dict(uuid='b18f0e70-8705-470d-bc4b-09a8da617e15',UM_distinctid='15d188be71d50-013c49b12ec14a-3f73035d-100200-15d188be71ffd')
resp = requests.get(url,cookies = cookies)

Copy after login

# 把浏览器的cookies字符串转成字典
def cookies2dict(cookies):
    items = cookies.split(';')
    d = {}
    for item in items:
        kv = item.split('=',1)
        k = kv[0]
        v = kv[1]
        d[k] = v
    return d

Copy after login

Note: After using browser cookies to initiate a request, if the request frequency is too frequent, the IP will still be blocked. At this time, you can perform the corresponding actions on the browser. Manual verification (such as clicking on the verification image, etc.), and then you can continue to use the cookie to initiate requests normally.

Use proxy

You can use multiple proxy IPs for access to prevent the same IP from launching too many requests and being blocked, such as:

proxies = {'http':'http://10.10.10.10:8765','https':'https://10.10.10.10:8765'}
resp = requests.get(url,proxies = proxies)
# 注：免费的代理IP可以在这个网站上获取：http://www.xicidaili.com/nn/

Copy after login

[Recommended courses ：Python video tutorial】

The above is the detailed content of How to prevent IP being blocked by python crawler. For more information, please follow other related articles on the PHP Chinese website!