Webmaster2020

Webmaster
2020

Home / marketing

Robots.txt
Publish date 23/04/2008

robots.txt is file normally only used by crawlers, bots or spiders. This are computer programs used by search engines like Google, Yahoo!, MSN Search, ... to search the internet for interesting pages, so they can index these pages and show them as search result.
This is good, but sometimes you don't want things to be indexed.
For example the directory where you pics are, or perhaps your admin area?
With robots.txt you can tell this to the search engines.

Location

robots.txt must be placed in the root directory of your site.
/robots.txt
Any other location is not vallid.
http://www.example.com/robots.txt OK
http://www.example.com/site/robots.txt NOT OK
If you don't have access to your root-directory you can use the robots meta-tag.

Content

In its basic form you can add the following 2 lines.
User-agent: * Disallow: /
This tells all robots to exclude the entire site from indexing.
Of course no webmaster really wants this.
With User-agent: you can identify which robot can do what.
The biggest names are Googlebot, MSNBot, Slurp (Yahoo!) but there are many more. A full list can be found at http://www.robotstxt.org/db.html
But mostly you will use User-agent: *, which means all robots.

Folders

With Disallow: you can block out entire directories (Folders).
A realistic example:
User-agent: * Disallow: /tmp Disallow: /cgi-bin Disallow: /folder/images
Note: The urls are case-sensitive! So Disallow: /tmp is not the same as Disallow: /TMP.

Files

You can also specify files:
User-agent: * Disallow: /secret/secretfile.html
Again it is case-sensitive.
Some crawlers let you also use wildcards:
User-agent: * Disallow: /*_print*.html Disallow: /*?sessionid
The above rules disallow html files or html files in a directory which has "_print" in it.
Example: /card_print.html or /store_print/index.html. (But does allow /card_print.php)

Disallow files with "?sessionid" in it
Example: /cart.php?sessionid=344ba33

Note: A trailing * is not needed. It is always assumed.
If you don't want that some crawlers let you also use $ - Anchors the end or a URL (Slurp Yahoo! allows this).
User-agent: Slurp Disallow: /*.gif$
This disallows all gif-files from being indexed.

Allow

Until now we only used Disallow: but there is also Allow:.
It is not so much used, because when a crawler doesn't find a match in robots.txt it assumes that it is allowed.
This allows for a more secure setup. Yes, the bad guys now this too, and what can be more interesting then the files that are not allowed?
Example:
User-agent: * Disallow: /org/plans.html Allow: /org/ Allow: /public Disallow: /
Importand to know is that crawlers stop when they find a match. So if you would start with Disallow: / nothing would get indexed no matter what you put bellow it, because everything starts with /.
For this example only the files in /org/ and files starting with /public will get indexed. Except /org/plans.html.

Sitemaps

If you have sitemaps for you site you can add the link with the following code.
Sitemap: sitemaplocation
Make sure you put the full URL of the sitemap. (http://www.example.com/sitemaps.xml)
You can add multiple lines if you have multiple sitemap-files.

RFC

For reference the official RFC can be found here.

TOP