Site Structure for Search Engines

robots.txt

Disallow

For me, the most common use of robots.txt is to prevent search engine robots from scanning a site while it is in development. To do that, create a simple text file called robots.txt and the contents should be:

User-agent: *
Disallow: /

Production Use

Once your website goes live, then the robots.txt file should not be used to disallow indexing. Instead, use Meta Robots instead. I will however, use robots.txt to tell the spiders where to find the sitemap file.

Sitemap: http://www.yourDomain.com/sitemap.xml

More Info

Check out http://www.robotstxt.org/robotstxt.html if you want to get fancier with your use, such as preventing (asking really, since you really can't prevent anything/anyone from actually viewing your public site) only a section of your site from being indexed, or asking only certain robots from visiting.

sitemap.xml

Information about site maps.

www.sitemaps.org

Online generator

www.xml-sitemaps.com

Meta Robots

You can request specific action of the search engine spiders by including a meta robots tag in the head section of each web page. Example:

<meta name="robots" content="noindex, nofollow, noarchive">

Valid content keywords are:

noindex (Don't index the page)
nofollow (Don't follow or crawl any links on the page)
noarchive (Don't show a link to the cached version of the page)
noodp (Don't use DMOZ data for meta description or title tag, which only applies to the home page)
nosnippet (Don't create a descriptive snippet to show on a SERP)
nopreview (Don't create a preview on the SERP)