Robots.txtPublish date 23/04/2008
robots.txt is file normally only used by crawlers, bots or spiders. This are computer programs used by search engines like Google, Yahoo!, MSN Search, ... to search the internet for interesting pages, so they can index these pages and show them as search result.
This is good, but sometimes you don't want things to be indexed.
For example the directory where you pics are, or perhaps your admin area?
With robots.txt you can tell this to the search engines.
robots.txt must be placed in the root directory of your site.Any other location is not vallid.
http://www.example.com/site/robots.txt NOT OK
If you don't have access to your root-directory you can use the robots meta-tag.
In its basic form you can add the following 2 lines.This tells all robots to exclude the entire site from indexing.
User-agent: * Disallow: /
Of course no webmaster really wants this.
User-agent:you can identify which robot can do what.
The biggest names are Googlebot, MSNBot, Slurp (Yahoo!) but there are many more. A full list can be found at http://www.robotstxt.org/db.html
But mostly you will use
User-agent: *, which means all robots.
Disallow:you can block out entire directories (Folders).
A realistic example:Note: The urls are case-sensitive! So
User-agent: * Disallow: /tmp Disallow: /cgi-bin Disallow: /folder/images
Disallow: /tmpis not the same as
You can also specify files:Again it is case-sensitive.
User-agent: * Disallow: /secret/secretfile.html
Some crawlers let you also use wildcards:
User-agent: * Disallow: /*_print*.html Disallow: /*?sessionid
Note: A trailing * is not needed. It is always assumed.
- The above rules disallow html files or html files in a directory which has "_print" in it.
Example: /card_print.html or /store_print/index.html. (But does allow /card_print.php)
- Disallow files with "?sessionid" in it
If you don't want that some crawlers let you also use $ - Anchors the end or a URL (Slurp Yahoo! allows this).
This disallows all gif-files from being indexed.
User-agent: Slurp Disallow: /*.gif$
Until now we only used
Disallow:but there is also
It is not so much used, because when a crawler doesn't find a match in robots.txt it assumes that it is allowed.
This allows for a more secure setup. Yes, the bad guys now this too, and what can be more interesting then the files that are not allowed?
Example:Importand to know is that crawlers stop when they find a match. So if you would start with
User-agent: * Disallow: /org/plans.html Allow: /org/ Allow: /public Disallow: /
Disallow: /nothing would get indexed no matter what you put bellow it, because everything starts with /.
For this example only the files in /org/ and files starting with /public will get indexed. Except /org/plans.html.
If you have sitemaps for you site you can add the link with the following code.Make sure you put the full URL of the sitemap. (http://www.example.com/sitemaps.xml)
You can add multiple lines if you have multiple sitemap-files.
For reference the official RFC can be found here.