Sometimes, you might not want search engines to crawl or index some pages of your website – for example if they contain confidential documents, are only meant for registered users or are out-of-date.
Normally, you can achieve this by implementing approaches such as meta robots tags, robots, or 301 redirects. Unfortunately, different search engines follow different rules – and this is especially the case with Baidu. If you want to introduce your site to Baidu, you have to consider what Baidu supports. In this blog post I will share a summary of what Baidu supports compared to Google.
You should use a canonical tag when you have two very similar or duplicate pages and want to keep both of them, but do not want to be penalised for duplicate content issues. Adding a canonical tag in the source code will tell the search engines which page is the correct version to index.
Both Baidu and Google support canonical tags, but do not 100% comply with them. It is recommended to use canonical tags only on duplicate or very similar pages. If Baidu finds out that your site does not use canonical tags correctly, it will simply ignore all canonical tags on your site.
By adding meta robots to the source code, you can tell robots how to handle certain pages on your site, such as disallowing search engines from following certain links, indexing a page or showing snippets in the search engine result pages. At the moment, Baidu only supports “nofollow” and “noarchive”. I believe Baidu does not yet support the rest. This means that even if you implemented “noindex” on a page, Baidu will still index it. On the other hand, Google supports all of the above.
301 redirects are used for permanent URL redirection. When you have a page that you do not want to use anymore, you can use a 301 redirect to point to a new page. Both Baidu and Google support 301 redirects.
Robots.txt is a text file placed in the root of the website. It is required when you want to have some content on your site excluded from the search engines. Both Baidu and Google support Robots.txt.
Below is the list of Baidu Spiders. You can disallow some of them by adding corresponding commands in Robots.txt. For example, if you want Baidu to only crawl the video files in the folder www.yourdomain.com/video/ , you can add this in the Robots.txt:
HTTP and HTTPS
I also want to mention Hypertext Transfer Protocol Secure (HTTPS). As opposed to Google, Baidu does not index HTTPS pages. To get the content indexed, Baidu advises webmasters to create a corresponding HTTP page with same content as the HTTPS page and to redirect Baidu Spiders to the HTTP page. Then Baidu will be able to index that HTTP page.
Below is how Baidu cached Alipay. Baidu cached its HTTP page.
But visitors will be redirected to the HTTPS page:
Personally, I do not think this is a very good method. Since Google indexes HTTPS pages, this method might result in a duplicate content issue with Google. So if you have some content that may need to be indexed by both Google and Baidu, I suggest using only a HTTP page.
Latest posts by Yao Yining (see all)
- How to ensure Baidu crawls and indexes your site correctly - November 5, 2014
- How to start an effective Chinese SEO campaign for Baidu - April 15, 2014
- Baidu SEO: How to dramatically improve search visibility with minimal effort - December 18, 2013