Choosing Robots.txt or the Noindex tag
Got into a brief discussion today about whether or not certain pages should only be linked to with a rel=nofollow tag, excluded in the robots.txt file, and/or should have a meta noindex tag on the page.
Is going with all 3 overkill? What ones are necessary? And what’s the easiest best practice to implement for a site with over 800 pages?
First of all, we should clarify what exactly these three choices are.
rel=nofollow tells a search engine not to follow this link. It not only prevents Pagerank flow, but it also prevents this page from being indexed IF the search spider doesn’t find any other links to it.
Robots.txt exclusions tell a search spider not to crawl or access that particular page.
META NoIndex tells a search engine not to list that page in its search results.
These may all sound similar, but there are some very subtle difference here. To help understand these differences, it’s best to understand what type of pages we’d likely apply them to.
Examples of pages you don’t want indexed or crawled include:
- Search results pages
- Thank you pages
- Error pages
- Steps in a signup process
- any other page you wouldn’t want a user to start on or see out of context
Basically, if (by some odd fate of chance) a user searches for something and comes upon my “thank you, you have been unsubscribed from my newsletter” page, that user is going to be lost to me. Additionally, they’re going to be confused as hell about the content of the page. Did it really unsubscribe them from something?
The old school way of preventing this was simply to list the page in Robots.txt so that the spiders couldn’t crawl it – but that alone isn’t enough. Looking to our list above, robots.txt only says not to crawl a page. It doesn’t say anything about listing it in the search results; and that’s exactly what happens. If somebody else links to a page that’s forbidden in your robots.txt file, search engines may still show that page’s URL in their results pages. They won’t have any information about it, but it will still be possible for users to click the link.
The other problem is that suddenly all of your form action & result pages are listed in robots.txt. This can provide valuable information to attackers and other people interested in compromising your website. For that reason, I prefer not to use robots.txt for this task.
rel=nofollow eliminates the list of pages created in robots.txt, but it’s also not very effective in keeping pages out of the search results. The problem with rel=nofollow is that it just doesn’t scale. I can tell the search engines not to follow my link, but what about somebody else who links to that page? I can’t count on that not to happen, and I certainly can’t count on them to put the nofollow tag in their link either.
That’s where the Meta NoIndex tag comes in. No matter how the spider ends up on the page or who linked to it, the NoIndex tag will always be there to tell the search engines not to index this page. Additionally, search spiders will still crawl the page and follow any links on it. This can be useful to people trying to manually shape their Pagerank flow.
For those of you curious, the tag looks like this:
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
So what do I do?
I use a 2 fold method. Firstly, I make sure I put a meta noindex tag on any page I don’t want indexed. Secondly, I always make sure to put a rel=nofollow tag on any links to that page from my website. This way, I keep my Pagerank flow how I want it and prevent my confirmation pages from being listed in search engines.
April 6th, 2009