Python crawler agent IP pool implementation method-Python Tutorial-php.cn

Python crawler agent IP pool implementation method

高洛峰

Release： 2017-02-11 13:09:46

Original

2494 people have browsed it

Working as a distributed deep web crawler in the company, we have built a stable proxy pool service to provide effective proxies for thousands of crawlers, ensuring that each crawler gets a valid proxy IP for the corresponding website, thereby ensuring that the crawler It runs quickly and stably, so I want to use some free resources to build a simple proxy pool service.

Working as a distributed deep web crawler in the company, we have built a stable proxy pool service to provide effective proxies for thousands of crawlers, ensuring that each crawler gets a valid proxy IP for the corresponding website. , thereby ensuring the fast and stable operation of the crawler. Of course, things done in the company cannot be open sourced. However, I feel itchy in my free time, so I want to use some free resources to build a simple proxy pool service.

1. Question

Where does the proxy IP come from?
When I first learned crawling by myself, I didn’t have a proxy IP, so I went to websites with free proxies such as Xiqi Proxy and Express Proxy to crawl. There were still some proxies that could be used. Of course, if you have a better proxy interface, you can also connect it yourself. The collection of free agents is also very simple. It is nothing more than: visit the page page —> Regular/xpath extraction —> Save

How to ensure the quality of the agent?
It is certain that most of the free proxy IPs cannot be used, otherwise why would others provide paid IPs (but the fact is that the paid IPs of many agents are not stable, and many of them cannot be used). Therefore, the collected proxy IP cannot be used directly. You can write a detection program to continuously use these proxies to access a stable website to see if it can be used normally. This process can be multi-threaded or asynchronous, since detecting proxies is a slow process.

How to store the collected agents?
Here I have to recommend a high-performance NoSQL database SSDB that supports multiple data structures for proxying Redis. Supports queue, hash, set, k-v pairs, and T-level data. It is a very good intermediate storage tool for distributed crawlers.

How to make it easier for crawlers to use these proxies?
The answer is definitely to make it a service. Python has so many web frameworks. Just pick one and write an API for the crawler to call. This has many benefits. For example, when the crawler finds that the agent cannot be used, it can actively delete the agent IP through the API. When the crawler finds that the agent pool IP is not enough, it can actively refresh the agent pool. This is more reliable than the detection program.

2. Proxy pool design

The proxy pool consists of four parts:

ProxyGetter:
Proxy acquisition interface, currently there are 5 free ones Proxy source, every time it is called, the latest proxies of these 5 websites will be captured and put into the DB. You can add additional proxy acquisition interfaces by yourself;

DB:
is used to store the proxy IP. Currently, it only Support SSDB. As for why you chose SSDB, you can refer to this article. I personally think SSDB is a good Redis alternative. If you have not used SSDB, it is very simple to install. You can refer to here;

Schedule:
Scheduled task users regularly check the agent availability in the DB and delete unavailable agents. At the same time, it will also take the initiative to get the latest proxy through ProxyGetter and put it into the DB;

ProxyApi:
The external interface of the proxy pool. Since the proxy pool function is relatively simple now, I spent two hours looking at Flask. I was happy. The decision was made with Flask. The function is to provide get/delete/refresh and other interfaces for crawlers to facilitate direct use by crawlers.

Python crawler agent IP pool implementation method

[HTML_REMOVED] Design

3. Code module

High-level data structure in Python, dynamic Types and dynamic binding make it very suitable for rapid application development, and also suitable as a glue language to connect existing software components. It is also very simple to use Python to create this proxy IP pool. The code is divided into 6 modules:

Api: api interface related code. The api is currently implemented by Flask, and the code is also very simple. The client request is passed to Flask, and Flask calls the implementation in ProxyManager, including get/delete/refresh/get_all;

DB: database related code. The current database uses SSDB. The code is implemented in factory mode to facilitate expansion of other types of databases in the future;

Manager: get/delete/refresh/get_all and other interface specific implementation classes. Currently, the proxy pool is only responsible for managing the proxy. In the future, There may be more functions, such as the binding of agents and crawlers, the binding of agents and accounts, etc.;

ProxyGetter: Relevant codes obtained by agents. Currently, fast agents, agents 66, and agents are captured. , Xisha Proxy, and guobanjia are free proxies for five websites. After testing, these five websites only have sixty or seventy available proxies that are updated every day. Of course, they also support their own expansion of the proxy interface;

Schedule: Scheduled task related The code is now just implemented to refresh the code regularly and verify the available agents, using a multi-process approach;

Util: Stores some public module methods or functions, including GetConfig: the class that reads the configuration file config.ini, ConfigParse: integrated rewriting ConfigParser's class makes it case-sensitive, Singleton: implements a singleton, LazyProperty: implements lazy calculation of class properties. Etc.;

Other files: Configuration file: Config.ini, database configuration and proxy acquisition interface configuration. You can add a new proxy acquisition method in GetFreeProxy and register it in Config.ini to use;

4. Installation

Download code:

git clone git@github.com:jhao104/proxy_pool.git

或者直接到https://github.com/jhao104/proxy_pool 下载zip文件

Copy after login

Installation dependencies:

pip install -r requirements.txt

Copy after login

Startup:

需要分别启动定时任务和api
到Config.ini中配置你的SSDB

到Schedule目录下:
>>>python ProxyRefreshSchedule.py

到Api目录下:
>>>python ProxyApi.py

Copy after login

5. Use

After the scheduled task is started, all agents will be fetched into the database through the agent acquisition method and verified. Thereafter, it will be repeated every 20 minutes by default. About a minute or two after the scheduled task is started, you can see the available proxies refreshed in SSDB:

Python crawler agent IP pool implementation method

You can use it in the browser after starting ProxyApi.py The interface gets the proxy, here is the screenshot in the browser:
index page:

Python crawler agent IP pool implementation method

get page:

Python crawler agent IP pool implementation method

get_all page:

Python crawler agent IP pool implementation method

Used in crawlers. If you want to use it in crawler code, you can encapsulate this api into a function and use it directly. , For example:

import requests

def get_proxy():
  return requests.get("http://127.0.0.1:5000/get/").content

def delete_proxy(proxy):
  requests.get("http://127.0.0.1:5000/delete/?proxy={}".format(proxy))

# your spider code

def spider():
  # ....
  requests.get('https://www.example.com', proxies={"http": "http://{}".format(get_proxy)})
  # ....

Copy after login

6. Finally

I am in a hurry and the functions and codes are relatively simple. I will improve it when I have time in the future. If you like it, give it a star on github. grateful!

For more articles related to Python crawler agent IP pool implementation methods, please pay attention to the PHP Chinese website!