Home>Article>Web Front-end> Introduction to crawler protocol robots
The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites tell search engines which pages can be crawled through Robots protocol. , which pages cannot be crawled. This article will introduce the crawler protocol robots in detail
The full name of Robots protocol is "Robots Exclusion Protocol". Its function is to tell search engines which pages can be crawled and which pages cannot be crawled through Robots files. Fetching, fetching standards, etc. It is placed in the root directory of the website in the form of a text file, which can be modified and edited with any common text editor. For webmasters, writing the robots.txt file reasonably can make more reasonable use of search engines, block some low-quality pages, and improve the quality of the website and its friendliness to search engines.
The specific writing method is as follows:
(* is a wildcard character)
User-agent: *The following code represents Baidu’s search robot
User-agent: Baiduspider【Disallow】 The following code represents the prohibition of crawling admin The directory below the directory
Disallow: /admin/The following code indicates that crawling of all .jpg format images on the web page is prohibited
Disallow: /.jpg$The following code indicates that crawling is prohibited The adc.html file under the ab folder
Disallow:/ab/adc.htmlThe following code means that access to all URLs containing question marks (?) in the website is prohibited
Disallow: /*?*The following code indicates that access to all pages in the website is prohibited
Disallow: /【Allow】 The following code indicates that access to URLs with the suffix ".html" is allowed
Allow: .html$The following code indicates that the entire directory of tmp is allowed to be crawled
Allow: /tmpUsage The following code indicates that all The robot accesses all pages of the website
User-agent: *Allow: /The following code indicates that all search engines are prohibited from accessing any part of the website
User-agent: *Disallow: /The following code indicates that Baidu is prohibited from accessing any part of the website The robot accesses all directories under its website
User-agent: Baiduspider Disallow: /The following code prohibits all search engines from accessing the files in the three directories of the website: cgi-bin, tmp, and ~joe
User-agent: *Disallow: /cgi-bin/Disallow: /tmp/Disallow: /~joe/
[Myth 1]: All files on the website need to be crawled by spiders, so there is no need to add the robots.txt file. Anyway, if the file does not exist, all search spiders will be able to access all pages on the website that are not password protected by default
Whenever a user attempts to access a non-existent URL, the server will record 404 in the log Error (file cannot be found). Whenever a search spider looks for a robots.txt file that does not exist, the server will also record a 404 error in the log, so a robots.txt
should be added to the website [Misunderstanding 2]: In robots All files in the .txt file can be crawled by search spiders, which can increase the indexing rate of the website. Even if the program scripts, style sheets and other files in the website are indexed by spiders, they will not increase the indexing rate of the website. The inclusion rate will only waste server resources. Therefore, it must be set in the robots.txt file not to allow search spiders to index these files
The above is the detailed content of Introduction to crawler protocol robots. For more information, please follow other related articles on the PHP Chinese website!