Home / guides   Print version

Sitemaps

Publish date 24/04/2008

sitemaps are XML-files that inform search engines about pages on your site.
It can have a lot of information; when it was last updated, how often it changes and how importand the page is relative to other pages on your site.
This allows search engines to better crawl your site and index it.

The file

Usually it is named sitemaps.xml and placed in the root directory. (But you can give it another name in robot.txt)
Important to know is that it must be written in UTF-8 and can not be bigger then 10MB (10 485 760 bytes) or 50 000 URLs. You can compress it using gzip to save bandwidth but uncompressed it still can't be bigger then 10MB.

If you really want to list more pages in sitemaps, you can split over several pages.
More here.

 

The format

Below is a basic example of a sitemap:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.com/</loc>
    <lastmod>2008-03-24</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>http://www.example.com/products.html</loc>
    <changefreq>monthly</changefreq>
  </url>
  <url>
    <loc>http://www.example.com/detail.php?category=2&ln=54</loc>
    <lastmod>2008-03-10T12:00:00+00:00</lastmod>
    <priority>0.4</priority>
  </url>
</urlset>
As with all XML-files, any data must use entity escaped codes for special characters.
Character Escaped code
Ampersand & &amp;
Single quote ' &apos;
Double quote " &quote;
Greater than > &gt;
Less then < &lt;

The available tags are described below:
Attribute   Description
<urlset> required Encapsulates the files and reference of the protocol
<url> required Parent tag of the URL. Everything below are children of this tag.
<loc> required URL of the page. Must begin with the protocol (http://) and end with a trailing slash.
<lastmod> optional Date of the last modification of this page (or file). This should be in W3C Datetime format.
<changefreq> optional How frequently this page changes. Valid values are:
  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never
The value always should be used for documents each time they are visited. The value never is for archived URLs.
<priority> optional The priority of the URL relevent to other URLs on your site. Between 0.0 and 1.
1 is then for your most important page.

 

Sitemap Index

There is a limit on how many pages you list in one sitemaps-file. 50 000 URLs or 10MB (10 485 760 bytes).
But you can use multiple files, of course you need to tell them where they can be found. here comes the sitemap index file.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/shemas/sitemap/0.9">
  <sitemap>
     <loc>http://www.example.com/sitemap1.xml.gz</loc>
     <lastmod>2008-02-14T18:31:17+00:00</lastmod>
  </sitemap>
  <sitemap>
     <loc>http://www.example.com/example/sitemap2.xml</loc>
     <lastmod>2008-03-20</lastmod>
  </sitemap>
</sitemapindex>
It must also be UTF-8 encoded, and can not list more then 1000 sitemaps or be larger then 10MB.
You can only specify sitemaps on the same site, not on other sites.

Tag   Description
<sitemapindex> required Encapsulates information about all the sitemaps in the file.
<sitemap> required Encapsulates information about 1 individual sitemap in the file.
<loc> required Location of the sitemap file.
<lastmod> optional Identifies the sitemap file was last modified.

 

Let it be known

When you have upload the file to your server, you need to make sure that search engine crawlers can find the file.

robots.txt

The easiest way is to add it to your robots.txt-file.
Simply add the following line:

Sitemap: sitemaplocation
Make sure you put the full URL of the sitemap. (http://www.example.com/sitemaps.xml)
You can add multiple lines if you have multiple sitemap-files.

Submitting

You can also submit it directly to the searchengine.

Submitting via an HTTP request

This is for the more tech-savvy. You can submit it directly to a searchengine using an HTTP request. This can be done with wget, curl or any other program. A successful request will return an HTTP 200 response code.

 

A complete reference can be found at www.sitemaps.org

 

TOP