Robots.txt
Publish date 23/04/2008
robots.txt is file normally only used by crawlers, bots or spiders. This are computer programs used by search engines like Google, Yahoo!, MSN Search, ... to search the internet for interesting pages, so they can index these pages and show them as search result.
This is good, but sometimes you don't want things to be indexed.
For example the directory where you pics are, or perhaps your admin area?
With robots.txt you can tell this to the search engines.Location
robots.txt must be placed in the root directory of your site.
Any other location is not vallid./robots.txt
http://www.example.com/robots.txt OK
http://www.example.com/site/robots.txt NOT OKIf you don't have access to your root-directory you can use the robots meta-tag.
Content
In its basic form you can add the following 2 lines.
This tells all robots to exclude the entire site from indexing.
User-agent: * Disallow: /
Of course no webmaster really wants this.With
User-agent:
you can identify which robot can do what.
The biggest names are Googlebot, MSNBot, Slurp (Yahoo!) but there are many more. A full list can be found at http://www.robotstxt.org/db.html
But mostly you will useUser-agent: *
, which means all robots.Folders
With
Disallow:
you can block out entire directories (Folders).
A realistic example:Note: The urls are case-sensitive! So
User-agent: * Disallow: /tmp Disallow: /cgi-bin Disallow: /folder/imagesDisallow: /tmp
is not the same asDisallow: /TMP
.Files
You can also specify files:
Again it is case-sensitive.
User-agent: * Disallow: /secret/secretfile.htmlSome crawlers let you also use wildcards:
User-agent: * Disallow: /*_print*.html Disallow: /*?sessionidNote: A trailing * is not needed. It is always assumed.
- The above rules disallow html files or html files in a directory which has "_print" in it.
Example: /card_print.html or /store_print/index.html. (But does allow /card_print.php)- Disallow files with "?sessionid" in it
Example: /cart.php?sessionid=344ba33If you don't want that some crawlers let you also use $ - Anchors the end or a URL (Slurp Yahoo! allows this).
This disallows all gif-files from being indexed.
User-agent: Slurp Disallow: /*.gif$Allow
Until now we only used
Disallow:
but there is alsoAllow:
.
It is not so much used, because when a crawler doesn't find a match in robots.txt it assumes that it is allowed.
This allows for a more secure setup. Yes, the bad guys now this too, and what can be more interesting then the files that are not allowed?
Example:Importand to know is that crawlers stop when they find a match. So if you would start with
User-agent: * Disallow: /org/plans.html Allow: /org/ Allow: /public Disallow: /Disallow: /
nothing would get indexed no matter what you put bellow it, because everything starts with /.
For this example only the files in /org/ and files starting with /public will get indexed. Except /org/plans.html.Sitemaps
If you have sitemaps for you site you can add the link with the following code.
Make sure you put the full URL of the sitemap. (http://www.example.com/sitemaps.xml)Sitemap: sitemaplocation
You can add multiple lines if you have multiple sitemap-files.RFC
For reference the official RFC can be found here.