All too often, I come across forum messages, blog posts, and SEO novices claiming you can stop your pages from showing up in search engine results pages, or SERPs, by including the page you wish to omit in a “disallow” statement in the robots.txt file.
I am here to let the Internet know once and for all, this is wrong. But before you begin to panic, fear not. I’m also going to tell you about a solution that works.
Before we get there, some of you might be asking yourself, “Why would someone want to prevent their page from showing up in the SERPs to begin with?” Some examples of pages you typically don’t want floating around the Web include:
- Thank you pages
- Your 404 page
- Search results pages
- Paid search landing pages
- Pages with confidential information
These items generally lead to a confusing or poor user experience if reached from search.
So why doesn’t a Robots Disallow Statement keep your pages out of search results?
A disallow statement in the robots.txt file stops search engines from accessing the content and code of the page. Pages that are disallowed can still show up in search results if there are links from external sources and people frequently visit it. Google basically thinks “I don’t know what’s on this page, but people seem to like it. Must be important–let’s rank it!”
A “noindex” code on the page stops search engines from listing and ranking it.
Doing both is not better. If a page is disallowed, search engines will not be able to read the “noindex” code that you’ve placed on the page.
Don’t feel bad if you’ve confused the two. Even Google gets it wrong.
Wrong… but let’s hear them out.
A little better, but still not correct. Password protection will keep your information safe, but users may still be able to find the site in SERPs and will simply hit a login screen instead of actual content. Also, as stated before, Web crawlers / search engines will not be able to read noindex meta tags if they are disallowed in robots.txt.
The official documentation from Google Webmaster Support.
This Robots.txt tutorial from SEOBook gives specific examples of format, wildcard matching, and information about “user-agents” or types of robots.
Removing Pages from Index
So, what can you do if you have pages that are being indexed that you didn’t want to have indexed? Bing and Google have “block URLs” and “remove URLs” capability in Webmaster Tools that you can use to block specific pages or the entire directory. In the case of Bing, this will take the page or entire site out of search results within 24 hours, and will keep it out for 90 days. Use this time to properly noindex the pages using meta tags, and you can request an extension on block URLs if necessary.
Noindex meta tag is the best way to keep your pages out of search, NOT robots.txt disallow. Do not use both as an extra precaution.
Hope this cleared up some of the confusion. Feel free to leave comments, questions, or call out some examples of other people confusing the two!