Introduction to crawler protocol robots-JS Tutorial-php.cn

Introduction to crawler protocol robots

巴扎黑

Release： 2017-07-19 15:47:50

Original

2568 people have browsed it

Previous words

The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites tell search engines which pages can be crawled through Robots protocol. , which pages cannot be crawled. This article will introduce the crawler protocol robots in detail

The full name of Robots protocol is "Robots Exclusion Protocol". Its function is to tell search engines which pages can be crawled and which pages cannot be crawled through Robots files. Fetching, fetching standards, etc. It is placed in the root directory of the website in the form of a text file, which can be modified and edited with any common text editor. For webmasters, writing the robots.txt file reasonably can make more reasonable use of search engines, block some low-quality pages, and improve the quality of the website and its friendliness to search engines.

The specific writing method is as follows:

(* is a wildcard character)

##User-agent: * represents all search engine types,

Disallow: /admin/ This definition prohibits crawling of directories under the admin directory.

Disallow: /require/ This definition prohibits crawling of directories under the require directory.

Disallow : /ABC/ This definition is to prohibit crawling the directories under the ABC directory

Disallow: /cgi-bin/*.htm It prohibits access to all files with the ".htm" suffix in the /cgi-bin/ directory. URL (including subdirectories).

Disallow: /*?* Disallows access to all URLs containing question marks (?) in the website

Disallow: /.jpg$ Disallows crawling of all .jpg format images on the webpage

Disallow:/ab/adc.html Disallows crawling of the adc.html file under the ab folder.

Allow: /cgi-bin/ The definition here is to allow crawling of directories under the cgi-bin directory

Allow: /tmp The definition here is to allow crawling of the entire directory of tmp

Allow: .htm$ Only allows access to URLs with the suffix ".htm".

Allow: .gif$ allows crawling web pages and gif format images

Sitemap: Sitemap tells the crawler that this page is a sitemap

Overview

Robots A .txt file is a text file that is the first file that search engines look at when visiting a website. The robots.txt file tells the spider what files can be viewed on the server

When a search spider visits a site, it will first check whether robots.txt exists in the root directory of the site. If it exists, The search robot will determine the scope of access based on the contents of the file; if the file does not exist, all search spiders will be able to access all pages on the website that are not password protected

[Principle]

The Robots protocol is a common code of ethics in the international Internet community. It is established based on the following principles:

1. Search technology should serve human beings, while respecting the wishes of information providers and maintaining their privacy rights;

　2. Websites have the obligation to protect their users’ personal information and privacy from infringement

[Note] robots.txt must be placed in the root directory of a site, and the file name must be all lowercase

Writing

[User-agent]

In the following code, * represents all search engine types. * is a wildcard character, indicating all searches. Robot

User-agent: *

Copy after login

The following code represents Baidu’s search robot

User-agent: Baiduspider

Copy after login

【Disallow】

The following code represents the prohibition of crawling admin The directory below the directory

Disallow: /admin/

Copy after login

The following code indicates that crawling of all .jpg format images on the web page is prohibited

Disallow: /.jpg$

Copy after login

The following code indicates that crawling is prohibited The adc.html file under the ab folder

Disallow:/ab/adc.html

Copy after login

The following code means that access to all URLs containing question marks (?) in the website is prohibited

Disallow: /*?*

Copy after login

The following code indicates that access to all pages in the website is prohibited

Disallow: /

Copy after login

【Allow】

The following code indicates that access to URLs with the suffix ".html" is allowed

Allow: .html$

Copy after login

The following code indicates that the entire directory of tmp is allowed to be crawled

Allow: /tmp

Copy after login

Usage

The following code indicates that all The robot accesses all pages of the website

User-agent: *Allow:　/

Copy after login

The following code indicates that all search engines are prohibited from accessing any part of the website

User-agent: *Disallow: /

Copy after login

The following code indicates that Baidu is prohibited from accessing any part of the website The robot accesses all directories under its website

User-agent: Baiduspider
Disallow: /

Copy after login

The following code prohibits all search engines from accessing the files in the three directories of the website: cgi-bin, tmp, and ~joe

User-agent: *Disallow: /cgi-bin/Disallow: /tmp/Disallow: /~joe/

Copy after login

Myth

[Myth 1]: All files on the website need to be crawled by spiders, so there is no need to add the robots.txt file. Anyway, if the file does not exist, all search spiders will be able to access all pages on the website that are not password protected by default

Whenever a user attempts to access a non-existent URL, the server will record 404 in the log Error (file cannot be found). Whenever a search spider looks for a robots.txt file that does not exist, the server will also record a 404 error in the log, so a robots.txt

should be added to the website [Misunderstanding 2]: In robots All files in the .txt file can be crawled by search spiders, which can increase the indexing rate of the website. Even if the program scripts, style sheets and other files in the website are indexed by spiders, they will not increase the indexing rate of the website. The inclusion rate will only waste server resources. Therefore, it must be set in the robots.txt file not to allow search spiders to index these files

The above is the detailed content of Introduction to crawler protocol robots. For more information, please follow other related articles on the PHP Chinese website!