Monday, September 15, 2008

Use Robots.txt, Save the World

Robots.txt help the search engines learn all about your website

There is a growing interest in the little known file that every website should have in the root directory: robots.txt
It’s a very simple text file you can find all about at the robotstxt.org website
Why should you use it ? Here are some good reasons for you to consider.
Controlled access to your content

With a robots.txt file you can “ask” the search engines to “keep out” of certain areas of your website. A typical area you might like to exclude is your images folder: If you aren’t a photographer, painter and your images are for your website use only, there are good chances you don’t want them to be indexed and showing up on image search engines, for people to download, or hotlink.
Unfortunately grabbers and similar software (such as Email harvesting applications) will not read your robots.txt file disregarding any indication you may provide in this respect. But that’s life isn’t it, always someone being disrespectful to say the least …
You can keep search engines away from content you wish to keep out of sight, but remember your robots file is also subject to attention of hackers seeking sensitive objectives you might inadvertently list: keeping out the robots while inviting the hackers – keep this in mind.

(cont. from front…)

The growing importance of robots.txt

At SES New York a robots.txt summit was held where major search engines (Ask, Google, Microsoft, Yahoo!) participated, sharing interesting information on this file. Here are some numbers.
According to Keith Hogan from Ask:

i) Less than 35% of websites have a robots.txt file
ii) The majority of robots.txt files are copied from others found online
iii) On many occasions robots.txt files are provided by your web hosting service

It looks like the majority of webmasters aren’t familiar with this file. This is going to play a major role as the size of the web continues to grow: Spidering is a costly effort that search engines tend to optimize. Those web sites demonstrating optimal command (which in turn determines efficiency) will be rewarded.

During the summit, all search engines announced they will identify (or autodiscover) sitemaps via the robots.txt file. In essence search engines are now able to discover your sitemap via a link in the following format:

Sitemap: , where is the complete URL of your Sitemap Index File (or your sitemap file, if you don’t have an index file).
Being compliant to Google Terms of Service

Robots.txt can help prevent you getting banned or being penalized by Google. In a move to eliminate search results pages because “web search results don’t add value to users” Google has recently added the following sentence to their terms of service:

- Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.

How to implement a robots.txt file

If your website doesn’t support a sitemap and you do not have any areas to exclude, include an empty robots.txt file in your root directory. By doing so you are acknowledging full spidering of your entire site.
Carefully review the robots exclusion protocol available at robotstxt.org. If you must exclude numerous areas of your website, build your file in a step by step manner and monitor spider behaviour with a log analyser tool.
Test your robots.txt file with a few online tools and keep in mind that every spider has a different behaviour and spidering criteria.

Avoid useless spidering traffic

When your website grows to a significant size and achieves optimal visibility, spidering significantly increases to hundreds (if not thousands) of hits per day and will put your server and bandwidth to the test.
Recently I was called on to examine a blog burdened by a very unusual and extremely heavy spidering activity: the log file I examined reported an excess of 8 Gbyte of invisibile (spider) traffic over a 1 month period. Given the reduced amount of daily visitors (less than 200) and the reduced size of the blog (less than 100 posts), there was something wrong in the architecture.
It took just a few minutes to identify the problem: There was no robots.txt file.
At each request for a robots.txt there was a redirect to the home page of the blog triggering a complete download of the blog home page. Each download of the home page was approximately 250 K. There were thousands of unnecessary hits on the home page. This was causing a spidering frenzy that ceased when an empty robots.txt file was created and uploaded to the server. Traffic is now down from 8 Gbyte to 500 Mbyte.

Keep the spiders informed, help save the world

The web is growing by leaps and bounds, the use of a robots.txt file helps the search engines effectively allocate their resources and is a tangible sign of respect and courtesy. If you don’t have a robots.txt file on your website set one up now. Use it to inform the crawlers on how your site is organized, and how often it is changing. I think we should all do our part to avoid waste of resources, saving energy and helping to save the world.

----

By Sante J. Achille

No comments: