1. Why anti-crawler?
Before designing the anti-crawler system, let’s first take a look at what problems crawlers will bring to the website?
Essence Generally speaking, the websites on the Internet that can be browsed, viewed and used by people and the data on the websites are open and accessible, so there is no so-called "illegal authorized access" problem.
There is no essential difference between a crawler program accessing a web page and a human accessing a web page. In both cases, the client initiates an HTTP request to the website server. After receiving the request, the website server returns a content response to the client.
As long as a request is initiated, the website server must respond. To respond, the server's resources must be consumed.
There is a mutually beneficial relationship between website visitors and the website. The website provides visitors with the necessary information and services they need, and visitors also bring traffic, visitors, and Activity. Therefore, website owners will be willing to use the server's bandwidth, disk, and memory in order to serve their visitors.
And what about the crawler program? It is tantamount to prostitution for free. Doubling the resources of the website server and occupying the server bandwidth will not bring any benefits to the website. In fact, the final result will be harmful to the website itself.
Crawlers may be considered the African hyenas of the Internet, and it’s no wonder they are hated by website owners.
2. Identify crawlers
Since you hate crawlers, you have to keep crawlers out of the website. To deny crawler access, of course, you must first identify the crawler program among network visitors. How to identify it?
1. HTTP request header
General network visitors will use a browser to access the website, so this identification is for the most basic network Reptilian. The browser will bring its own request header to indicate its basic information. HTTP request headers are easily broken by crawlers because anyone can modify and forge them.
2. Cookie value
Cookies are usually used to identify website visitors, similar to a temporary ID card in your hand. And based on this, the identity is verified with the website server. Unfortunately, cookies are stored on the client side and can be modified and forged.
3. Access frequency
When a visitor requests a certain page of the website every second, or requests this page hundreds of times within a few seconds. This visitor is either a crawler or a ghost. Which human being can click a mouse quickly and frequently to visit a page? Does he have Parkinson's disease or is he a reincarnation of an octopus?
It is feasible to identify crawler programs by access frequency, but crawler programs can also use a large number of proxy IPs to achieve the effect of only accessing an IP address once, and can also be circumvented by random request intervals.
4. Mouse behavior trajectory
When ordinary human visitors browse web pages, they will not move the mouse and click step by step like a machine. The movements and clicks of the mouse can be captured through JS scripts, so you can judge whether the visitor is a crawler program by judging the visitor's mouse behavior trajectory.
5. Token value
Many websites now adopt a development method that separates the front and back ends. The back-end interface returns data to the front-end, and the front-end combines it after getting the data. The page is rendered. Therefore, many crawler programs directly look for the data interface instead of foolishly requesting the page. The token is used to verify these backend data interfaces. Usually, a Token is encrypted by a combination of time and web page keys.
There are more ways to identify crawlers, which I will not introduce one by one here. Unfortunately, no matter which of the above methods are used to identify web crawlers, there is always the risk of being deceived or penetrated by the crawlers.
3. Reject crawlers
Just like there is no once-and-for-all website security protection, closing the 3389 port ten years ago can prevent the server from becoming a broiler. Nowadays, various A variety of firewalls and various security measures have been added, and it is possible to be blackmailed due to a 0Day vulnerability.
There is always a struggle and escalation between reptiles and anti-reptiles. The difference between cyber attack and defense and anti-crawling is that the former is a no-holds-barred fight, while the latter involves gloves and helmets like boxers competing in the Olympics.
In order to operate the website, it is necessary to open the content to the outside world, and the open content is like the smell of carrion and blood floating in the African savannah, directly attracting the arrival of hyenas.
It is a difficult task to balance between open content and avoiding becoming a data mining pool for crawlers.
1. Limit the opening of content
Open content is the basis for acquiring users and traffic, so the content must be open. But the openness of content is not unlimited openness. Unregistered users can view one or two pieces of content, but do not have unlimited access to all content. This restriction can take the form of click verification mechanisms such as login, scan code verification or access to Google verification code.
Nowadays, more and more websites have adopted the mechanism of limited content opening, such as Weibo, Zhihu, Taobao, etc. You can see one or two pages of content, but if you still want to continue, Sorry, please log in.
2. Behaviorally record user operations
Require visitors to log in does not solve the problem, because simulated login has always been a popular development branch of web crawler programs, whether it is image verification Codes, puzzles, sliders or clicking Chinese characters will all be broken. Writing APP and crawler programs can allow SMS verification codes to communicate with the website.
So recording user behavior is essential. All user operations and access behaviors need to be recorded. This is the basis for analyzing and processing crawlers.
3. Strictly crack down on high-frequency behavior in terms of control
In reality, there are also many crawler programs that are not run to excavate the data and content of the website. , just to facilitate manual collection and sorting work, this type of crawler behavior is generally higher than the frequency of manual browsing, but significantly lower than the hyena-like high-frequency crawler. This type of crawler behavior can be Ignore it. Keep a line in your life so that we can meet again in the future.
However, measures must be taken for high-frequency crawler behavior that affects the operation of the website server. Combine user and IP information to process relevant users or IPs.
4. Declaration of rights in the agreement
The owner of the website must declare in the website agreement or user agreement that normal browsing, access and data acquisition are allowed. , high frequency, and threatening the stability of the website server, we will reserve the right to further process.
The above is the detailed content of What are the knowledge points of Python anti-crawler?. For more information, please follow other related articles on the PHP Chinese website!