Home  >  Article  >  Backend Development  >  Example sharing of python obtaining proxy IP

Example sharing of python obtaining proxy IP

不言
不言Original
2018-05-07 14:08:292167browse

This article mainly introduces the sharing of examples about python obtaining proxy IP. It has certain reference value. Now I share it with everyone. Friends in need can refer to it.

Usually when we need to crawl some of our When data is needed, there are always some websites that prohibit repeated visits from the same IP. At this time, we should use a proxy IP to disguise ourselves before each visit so that the "enemy" cannot detect it.

ooooooooooooooOK, let's start happily!

This is the file to get the proxy IP. I modularized them and divided them into three functions

Note: There will be some English comments in the article , for the convenience of writing code, after all, one or two words in English are ok

#!/usr/bin/python
#-*- coding:utf-8 -*-
"""
author:dasuda
"""
import urllib2
import re
import socket
import threading
findIP = [] #获取的原始IP数据
IP_data = [] #拼接端口后的IP数据
IP_data_checked = [] #检查可用性后的IP数据
findPORT = [] #IP对应的端口
available_table = [] #可用IP的索引
def getIP(url_target):
 patternIP = re.compile(r&#39;(?<=<td>)[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}&#39;)
 patternPORT = re.compile(r&#39;(?<=<td>)[\d]{2,5}(?=</td>)&#39;)
 print "now,start to refresh proxy IP..."
 for page in range(1,4):
  url = &#39;http://www.xicidaili.com/nn/&#39;+str(page)
  headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64)"}
  request = urllib2.Request(url=url, headers=headers)
  response = urllib2.urlopen(request)
  content = response.read()
  findIP = re.findall(patternIP,str(content))
  findPORT = re.findall(patternPORT,str(content))
  #assemble the ip and port
  for i in range(len(findIP)):
   findIP[i] = findIP[i] + ":" + findPORT[i]
   IP_data.extend(findIP)
  print(&#39;get page&#39;, page)
 print "refresh done!!!"
 #use multithreading
 mul_thread_check(url_target)
 return IP_data_checked
def check_one(url_check,i):
 #get lock
 lock = threading.Lock()
 #setting timeout
 socket.setdefaulttimeout(8)
 try:
  ppp = {"http":IP_data[i]}
  proxy_support = urllib2.ProxyHandler(ppp)
  openercheck = urllib2.build_opener(proxy_support)
  urllib2.install_opener(openercheck)
  request = urllib2.Request(url_check)
  request.add_header(&#39;User-Agent&#39;,"Mozilla/5.0 (Windows NT 10.0; WOW64)")
  html = urllib2.urlopen(request).read()
  lock.acquire()
  print(IP_data[i],&#39;is OK&#39;)
  #get available ip index
  available_table.append(i)
  lock.release()
 except Exception as e:
  lock.acquire()
  print(&#39;error&#39;)
  lock.release()
def mul_thread_check(url_mul_check):
 threads = []
 for i in range(len(IP_data)):
  #creat thread...
  thread = threading.Thread(target=check_one, args=[url_mul_check,i,])
  threads.append(thread)
  thread.start()
  print "new thread start",i
 for thread in threads:
  thread.join()
 #get the IP_data_checked[]
 for error_cnt in range(len(available_table)):
  aseemble_ip = {&#39;http&#39;: IP_data[available_table[error_cnt]]}
  IP_data_checked.append(aseemble_ip)
 print "available proxy ip:",len(available_table)

1. getIP(url_target): The main function incoming parameters are: the URL to verify the availability of the proxy IP, It is recommended that ipchina

obtain the proxy IP from the http://www.xicidaili.com/nn/ website. It is a website that provides free proxy IP, but not all IPs in it are It can be used, and based on your actual geographical location, network conditions, target server accessed, etc., probably less than 20% can be used, at least in my case.

Use the normal method to access the http://www.xicidaili.com/nn/ website. The returned web page content obtains the required IP and corresponding port through regular query. The code is as follows:

patternIP = re.compile(r&#39;(?<=<td>)[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}&#39;)
patternPORT = re.compile(r&#39;(?<=<td>)[\d]{2,5}(?=</td>)&#39;)
...
findIP = re.findall(patternIP,str(content))
findPORT = re.findall(patternPORT,str(content))

About How to construct a regular expression, you can refer to other articles:

The obtained IP is stored in findIP, and the corresponding port is in findPORT. The two correspond to each other by index. The normal number of IPs obtained on a page is 100.

Next, IP and port splicing

Finally, availability check

2. check_one(url_check,i): thread function

This visit to url_check is still done in the normal way. When the web page is returned, it means that the proxy IP is available, and the current index value is recorded, which will be used to extract all available IPs later.

3. mul_thread_check(url_mul_check): Multi-thread generation

This function enables multi-threading to check the proxy IP availability, and each IP opens a thread Check.

This project directly calls getIP() and passes in the URL used to check availability, and then a list is returned, which is a list of IPs that have been checked for availability, in the format of

[&#39;ip1:port1&#39;,&#39;ip2:port2&#39;,....]

Related recommendations :

Instance of Python crawler grabbing proxy IP and checking availability

Python method to collect proxy IP and determine whether it is available and update it regularly

The above is the detailed content of Example sharing of python obtaining proxy IP. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn