Preface
Take a recently discovered free proxy IP website as an example: http://www.xicidaili.com/nn/. When using it, I found that many IPs cannot be used.
So I wrote a script in Python, which can detect the available proxy IPs.
The script is as follows:
#encoding=utf8 import urllib2 from bs4 import BeautifulSoup import urllib import socket User_Agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0' header = {} header['User-Agent'] = User_Agent ''' 获取所有代理IP地址 ''' def getProxyIp(): proxy = [] for i in range(1,2): try: url = 'http://www.xicidaili.com/nn/'+str(i) req = urllib2.Request(url,headers=header) res = urllib2.urlopen(req).read() soup = BeautifulSoup(res) ips = soup.findAll('tr') for x in range(1,len(ips)): ip = ips[x] tds = ip.findAll("td") ip_temp = tds[1].contents[0]+"\t"+tds[2].contents[0] proxy.append(ip_temp) except: continue return proxy ''' 验证获得的代理IP地址是否可用 ''' def validateIp(proxy): url = "http://ip.chinaz.com/getip.aspx" f = open("E:\ip.txt","w") socket.setdefaulttimeout(3) for i in range(0,len(proxy)): try: ip = proxy[i].strip().split("\t") proxy_host = "http://"+ip[0]+":"+ip[1] proxy_temp = {"http":proxy_host} res = urllib.urlopen(url,proxies=proxy_temp).read() f.write(proxy[i]+'\n') print proxy[i] except Exception,e: continue f.close() if __name__ == '__main__': proxy = getProxyIp() validateIp(proxy)
After running successfully, open the file under E drive, you can see the following available Proxy IP address and port:
Summary
This is just the first page crawled IP address, if necessary, you can crawl a few more pages. At the same time, the website is updated from time to time, so it is recommended to only crawl the first few pages when crawling. The above is the entire content of this article. I hope it will be helpful to everyone learning to use Python.
For more articles related to proxy IPs available for Python crawling, please pay attention to the PHP Chinese website!