Example of Python crawler grabbing proxy IP and checking availability-Python Tutorial-php.cn

Example of Python crawler grabbing proxy IP and checking availability

不言

Release： 2018-05-07 12:00:34

Original

1803 people have browsed it

This article mainly introduces examples of Python crawlers grabbing proxy IPs and checking availability. It has certain reference value. Now I share it with you. Friends in need can refer to it.

Write crawlers often. It is inevitable that the IP will be blocked by the target website. One IP is definitely not enough. As a frugal programmer, if you can do it without spending money, then go find it yourself. This time I wrote about crawling. The IP on the West Spur proxy, but this website is also crawled! ! !

As for how to deal with it, I think you can try increasing the delay. Maybe I crawled too frequently, so my IP was blocked.

However, you can still try the IP bus. All roads lead to Rome, and you can’t hang yourself from a tree.

No nonsense, just code.

#!/usr/bin/env python
# -*- coding:utf8 -*-
import urllib2
import time
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
req_header = {&#39;User-Agent&#39;:&#39;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11&#39;,
 &#39;Accept&#39;:&#39;text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8&#39;,
 #&#39;Accept-Language&#39;: &#39;en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3&#39;,
 &#39;Accept-Charset&#39;:&#39;ISO-8859-1,utf-8;q=0.7,*;q=0.3&#39;,
 &#39;Accept-Encoding&#39;:&#39;en-us&#39;,
 &#39;Connection&#39;:&#39;keep-alive&#39;,
 &#39;Referer&#39;:&#39;http://www.baidu.com/&#39;
 }
req_timeout = 5
testUrl = "http://www.baidu.com/"
testStr = "wahaha"
file1 = open(&#39;proxy.txt&#39; , &#39;w&#39;)
# url = ""
# req = urllib2.Request(url,None,req_header)
# jsondatas = urllib2.urlopen(req,None,req_timeout).read()
cookies = urllib2.HTTPCookieProcessor()
checked_num = 0
grasp_num = 0
for page in range(1, 160):
 req = urllib2.Request(&#39;http://www.xici.net.co/nn/&#39; + str(page), None, req_header)
 html_doc = urllib2.urlopen(req, None, req_timeout).read()
 # html_doc = urllib2.urlopen(&#39;http://www.xici.net.co/nn/&#39; + str(page)).read()
 soup = BeautifulSoup(html_doc)
 trs = soup.find(&#39;table&#39;, id=&#39;ip_list&#39;).find_all(&#39;tr&#39;)
 for tr in trs[1:]:
  tds = tr.find_all(&#39;td&#39;)
  ip = tds[1].text.strip()
  port = tds[2].text.strip()
  protocol = tds[5].text.strip()
  if protocol == &#39;HTTP&#39; or protocol == &#39;HTTPS&#39;:
   #of.write(&#39;%s=%s:%s\n&#39; % (protocol, ip, port))
   print &#39;%s=%s:%s&#39; % (protocol, ip, port)
   grasp_num +=1
   proxyHandler = urllib2.ProxyHandler({"http": r&#39;http://%s:%s&#39; % (ip, port)})
   opener = urllib2.build_opener(cookies, proxyHandler)
   opener.addheaders = [(&#39;User-Agent&#39;,
         &#39;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36&#39;)]
   t1 = time.time()
   try:
    req = opener.open(testUrl, timeout=req_timeout)
    result = req.read()
    timeused = time.time() - t1
    pos = result.find(testStr)
    if pos > 1:
     file1.write(protocol+"\t"+ip+"\t"+port+"\n")
     checked_num+=1
     print checked_num, grasp_num
    else:
     continue
   except Exception,e:
    continue
file1.close()
print checked_num,grasp_num

Copy after login

Personally, I feel that the code is not too complicated, so I didn’t add comments. I believe everyone can basically understand it. If so, Please also criticize and correct any problems and make progress together!

Related recommendations:

Python method to collect proxy IP and determine whether it is available and update it regularly

The above is the detailed content of Example of Python crawler grabbing proxy IP and checking availability. For more information, please follow other related articles on the PHP Chinese website!