Robots Exclusion Standard
Also called: robots Exclusion Protocol or robots.txt
The Robots Exclusion Protocol includes some capabilities to give search engines instructions on which pages within a Web site may and may not be indexed. These capabilities can be used when it is undesirable for certain pages to be included in search results.
The "robots" are usually search engine spiders, programs that constantly scour the Web for new information for search engines. There are also web robots for other purposes. Whether the instructions are followed according to the standard depends on the particular robot. Thus, the protocol offers no guarantees. The crawlers of most major search engines (such as Google and Bing) respect these standards.
One protocol that serves to get pages included in search engines just the same is the XML Sitemap.
Robots.txt
Robots.txt is a file stored within the root directory of a domain(domainname.co.uk/robots.txt) that tells search engines which locations within the Web site they may or may not query.
Example of a robots.txt file:
User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
In this example, all robots are instructed not to crawl locations within the /cgi-bin/ and /admin/ directories. This example immediately highlights a disadvantage of robots.txt: the openness of the file can also actually expose locations we would rather not bring to attention.
Multiple terms and rules can be placed underneath each other. For example, a section that focuses purely on Google's crawlers starts with the rule "User-agent: googlebot." The prefix "Allow:" can also be used to create exceptions that are actually allowed to be accessed.
Robots.txt purely indicates which locations should not be queried by spiders. In theory, a search engine can include such a location in its search results, it just has no knowledge of the content of the page.
Another option for influencing spiders' behavior is a special robot meta tag. This HTML tag does not prevent spiders from retrieving the content of a page, but it does then give more control over what happens to the location and content.
Example of a robots meta tag:
<meta name="robots" content="noindex,nofollow" />
This example prescribes that the location of the page may not be included in search results. Also, hyperlinks on the page may not be followed. Counterparts of "noindex" and "nofollow" are "index" (do include) and "follow" (do follow links).