Home >Web Front-end >Front-end Q&A >A brief discussion on crawlers and bypassing website anti-crawling mechanisms

A brief discussion on crawlers and bypassing website anti-crawling mechanisms

coldplay.xixi
coldplay.xixiOriginal
2020-08-25 16:50:313096browse

 A brief discussion on crawlers and bypassing website anti-crawling mechanisms

【Related learning recommendations: Website production video tutorial

What is a crawler? To put it simply and one-sidedly, a crawler is a tool that allows a computer to automatically interact with a server to obtain data. The most basic thing of a crawler is to get the source code data of a web page. If you go deeper, you will have POST interaction with the web page and get the data returned by the server after receiving the POST request. In a word, crawlers are used to automatically obtain source data. As for more data processing, etc., they are follow-up work. This article mainly wants to talk about this part of crawlers obtaining data. Crawlers, please pay attention to the Robot.txt file of the website. Do not let crawlers break the law or cause harm to the website.

  Inappropriate examples of anti-crawling and anti-anti-crawling concepts

Due to many reasons (such as server resources, data protection, etc.), many websites limit the effectiveness of crawlers .

Think about it, if a human plays the role of a crawler, how do we obtain the source code of a web page? The most commonly used method is of course right-clicking the source code.

The website blocks the right click, what should I do?

A brief discussion on crawlers and bypassing website anti-crawling mechanisms

Take out the most useful thing we use in crawling, F12 (welcome to discuss)

Press F12 at the same time to open it (funny)

A brief discussion on crawlers and bypassing website anti-crawling mechanisms

The source code is out!!

When treating people as crawlers, block the right click It is the anti-crawling strategy, and F12 is the anti-crawling method.

  Let’s talk about the formal anti-crawling strategy

In fact, in the process of writing a crawler, there must be a situation where no data is returned. In this case, the server may Limiting the UA header (user-agent), this is a very basic anti-crawling. Just add the UA header when sending the request... Isn't it very simple? Adding all the required Request Headers is a simple and crude method...

Have you ever discovered that the verification code of the website is also an anti-crawling strategy? In order to ensure that the users of the website are real people, the verification code is really done A great contribution. Along with the verification code, verification code recognition appeared.

Speaking of which, I wonder whether verification code recognition or image recognition came first?

It is very simple to recognize simple verification codes now. There are too many tutorials on the Internet, including a little Advanced concepts such as denoising, binary, segmentation, and reorganization. But now website human-machine recognition has become more and more terrifying, such as this:

A brief discussion on crawlers and bypassing website anti-crawling mechanisms Let’s briefly talk about the concept of denoising binary values

Will a verification The code

A brief discussion on crawlers and bypassing website anti-crawling mechanisms becomes

A brief discussion on crawlers and bypassing website anti-crawling mechanisms which is a binary value, that is, changing the picture itself into only two tones, example It's very simple. It can be achieved through

 Image.convert("1")

in the python PIL library. However, if the image becomes more complex, you still need to think more, such as

If you use the simple method directly, it will become

Think about how to identify this verification code? This At this time, denoising comes in handy. Based on the characteristics of the verification code itself, the background color of the verification code and the RGB values ​​other than the font can be calculated, and these values ​​can be turned into a color, leaving the fonts alone. The sample code is as follows, just change the color

  for x in range(0,image.size[0]):
  for y in range(0,image.size[1]):
  # print arr2[x][y]
  if arr[x][y].tolist()==底色:
  arr[x][y]=0
  elif arr[x][y].tolist()[0] in range(200,256) and arr[x][y].tolist()[1] in range(200,256) and arr[x][y].tolist()[2] in range(200,256):
  arr[x][y]=0
  elif arr[x][y].tolist()==[0,0,0]:
  arr[x][y]=0
  else:
  arr[x][y]=255

Arr is obtained by numpy. The matrix is ​​obtained based on the RGB values ​​of the image. Readers can try to improve the code and experiment for themselves.

After careful processing, the picture can become

The recognition rate is still very high.

In the development of verification codes, there are wheels available online for fairly clear numbers and letters, simple addition, subtraction, multiplication and division. For some difficult numbers, letters and Chinese characters, you can also make your own wheels (such as the above), but there are more Things are enough to write an artificial intelligence... (One kind of job is to recognize verification codes...)

Add a little tip: Some websites have verification codes on the PC side, but not on the mobile phone side...

 Next topic!

One of the more common anti-crawling strategies is the IP blocking strategy. Usually too many visits in a short period of time will be blocked. This It's very simple. It's OK to limit the access frequency or add an IP proxy pool. Of course, it can also be distributed... Not used much but still ok.

Another kind of anti-crawler strategy is asynchronous data. With the gradual deepening of crawlers (it is obviously an update of the website!), asynchronous loading is a problem that will definitely be encountered, and the solution is still It's F12. Take the anonymous NetEase Cloud Music website as an example. After right-clicking to open the source code, try searching for comments

A brief discussion on crawlers and bypassing website anti-crawling mechanisms Where is the data?! This is asynchronous after the rise of JS and Ajax. Loaded features. But open F12, switch to the NetWork tab, refresh the page, and search carefully, there is no secret.

A brief discussion on crawlers and bypassing website anti-crawling mechanisms Oh, by the way, if you are listening to the song, you can download it by clicking in...

A brief discussion on crawlers and bypassing website anti-crawling mechanisms Only To popularize the structure of the website, please consciously resist piracy, protect copyright, and protect the interests of the original creator.

What should you do if this website restricts you? We have one last plan, an invincible combination: selenium PhantomJs

This combination is very powerful and can be perfect Simulate browser behavior. Please refer to Baidu for specific usage. This method is not recommended. It is very cumbersome. This is only for popular science.

The above is the detailed content of A brief discussion on crawlers and bypassing website anti-crawling mechanisms. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn