How to let search engines crawl AJAX content solution, crawl ajax

More and more websites are beginning to adopt "single-page structure" (Single-page application).

The entire website has only one web page, using Ajax technology to load different content based on user input.

The advantages of this approach are good user experience and traffic saving, but the disadvantage is that AJAX content cannot be crawled by search engines. For example, you have a website.

<code>　　http://example.com 　　</code>

Copy after login

Users see different content through the URL with a pound sign structure.

<code>　　http://example.com#1　　http://example.com#2　　http://example.com#3 　　</code>

Copy after login

However, search engines only crawl example.com and ignore the pound sign, so the content cannot be indexed.

In order to solve this problem, Google proposed the "pound sign + exclamation point" structure.

<code>　　http://example.com#!1　　</code>

Copy after login

When Google finds a URL like the one above, it will automatically crawl another URL:

<code>　　http://example.com/?_escaped_fragment_=1　　</code>

Copy after login

As long as you put AJAX content on this URL, Google will include it. But the problem is, "pound sign + exclamation mark" is very ugly and cumbersome. Twitter once used this structure, which put

<code>　　http://twitter.com/ruanyf　　</code>

Copy after login

changed to

<code>　　http://twitter.com/#!/ruanyf　　</code>

Copy after login

As a result, users complained repeatedly, and it was abolished after only half a year.

So, is there any way to keep the URL more intuitive while still allowing search engines to crawl the AJAX content?

I always thought there was no way to do it, until I saw the solution of Robin Ward, one of the founders of Discourse, two days ago, and I couldn't help but be amazed.

Discourse is a forum program that relies heavily on Ajax, but it must allow Google to include the content. Its solution is to abandon the pound sign structure and adopt the History API.

The so-called History API refers to changing the URL displayed in the browser address bar without refreshing the page (to be precise, it changes the current status of the web page). Here is an example where you click the button above to start playing music. Then, click on the link below again and see what happens?

The URL in the address bar has changed, but the music playback is not interrupted!

A detailed introduction to the History API is beyond the scope of this article. Simply put here, its function is to add a record to the browser's History object.

<code>　　window.history.pushState(state object, title, url);　　</code>

Copy after login

The above line of command can make a new URL appear in the address bar. The pushState method of the History object accepts three parameters. The new URL is the third parameter. The first two parameters can be null.

<code>　　window.history.pushState(null, null, newURL); 　　</code>

Copy after login

Currently, all major browsers support this method: Chrome (26.0+), Firefox (20.0+), IE (10.0+), Safari (5.1+), Opera (12.1+).

Here’s Robin Ward’s method.

First, use the History API to replace the hash mark structure, so that each hash mark becomes a normal path URL, so that search engines will crawl every web page.

<code>　　example.com/1　　example.com/2　　example.com/3　　</code>

Copy after login

Then, define a JavaScript function to handle the Ajax part and crawl the content based on the URL (assuming jQuery is used).

<code>function anchorClick(link) {<br>　　　　var linkSplit = link.split('/').pop();<br>　　　　$.get('api/' + linkSplit, function(data) {<br>　　　　　　$('#content').html(data);<br>　　　　});<br>　　}</code>

Copy after login

Redefine the click event of the mouse.

<code>　　$('#container').on('click', 'a', function(e) {<br>　　　　window.history.pushState(null, null, $(this).attr('href'));<br>　　　　anchorClick($(this).attr('href'));<br>　　　　e.preventDefault();<br>　　});　　</code>

Copy after login

Also take into account the user clicking the browser's "forward/back" buttons. At this time, the popstate event of the History object will be triggered.

<code>　　window.addEventListener('popstate', function(e) {     <br>　　　　anchorClick(location.pathname);  <br> 　　});</code>

Copy after login

After defining the above three pieces of code, you can display the normal path URL and AJAX content without refreshing the page.

Finally, set up the server side.

Because the pound sign structure is not used, each URL is a different request. Therefore, the server is required to return web pages with the following structure for all these requests to prevent 404 errors.

<code>　　<html><br>　　　　<body><br>　　　　　　<section id='container'></section><br>　　　　　　<noscript><br>　　　　　　　　... ...<br> 　　　　　　</noscript><br>　　　　</body><br>　　</html></code>

Copy after login

Look carefully at the above code, you will find a noscript tag, this is the secret.

We put all content that we want search engines to include in noscript tags. In this case, users can still perform AJAX operations without refreshing the page, but search engines will include the main content of each web page!

How to let Baidu search engine crawl my website content?

If your site is a new one, it will be slower for Baidu to include it. In addition, you can go to some other websites for promotion and make an anchor link in "Hongjian Double Salary". The link address will point directly to your website, which is a backlink problem!
Then it’s just a matter of waiting...
Generally, Google will include it faster. After Google includes it, Baidu will probably be faster!

How to prevent search engines from crawling your website information (excerpt)

First, create a robots.txt file in your website and directory. What are robots? It means: search engines use spider programs to automatically access web pages on the Internet and obtain web page information. When a spider visits a website, it will first check whether there is a plain text file called robots.txt under the root domain of the website. This file is used to specify the crawling range of the spider on your website. You can create a robots.txt in your website and declare in the file the parts of the website that you do not want search engines to include or specify that search engines only include specific parts. You only need to use a robots.txt file if your site contains content that you don't want search engines to index. If you want search engines to include all content on your website, do not create a robots.txt file. You may find that your website content will still be searched after creating a robots.txt file, but the content on your webpage will not be captured. Fetch, index and display, and what is displayed in Baidu search results is only the description of your relevant web pages from other websites. Prevent search engines from showing a snapshot of your site in search results and only index the page To prevent all search engines from showing a snapshot of your site, place this meta tag in the section of the page: To allow other search engines to display the snapshot, but prevent only Baidu from displaying it, use the following tag: of the robots.txt file The format "robots.txt" file contains one or more records separated by blank lines (terminated by CR, CR/NL, or NL). The format of each record is as follows: " :". You can use # to annotate this file, and the specific usage method is the same as the convention in UNIX. The records in this file usually start with one or more lines of User-agent, followed by several Disallow and Allow lines. The details are as follows: User-agent: The value of this item is used to describe the name of the search engine robot. In the "robots.txt" file, if there are multiple User-agent records, it means that multiple robots will be restricted by "robots.txt". For this file, there must be at least one User-agent record. If the value of this item is set to *, it is valid for any robot. In the "robots.txt" file, there can only be one record like "User-agent:*". If you add "User-agent:SomeBot" and several Disallow and Allow lines to the "robots.txt" file, then the name "SomeBot" is only restricted by the Disallow and Allow lines after "User-agent:SomeBot". Disallow: The value of this item is used to describe a set of URLs that you do not want to be accessed. This value can be a complete path or a non-empty prefix of the path. URLs starting with the value of the Disallow item will not be accessed by the robot. For example, "Disallow:/help" prohibits robot from accessing /help.html, /helpabc.html, and /help/index.html, while "Disallow:/help&...the rest of the text>>