The robots.txt file is an exclusion standard required by all web crawlers/robots to tell them what files and directories that you want them to stay OUT of on your site. Not all crawlers/bots follow the exclusion standard and will continue crawling your site anyway. I like to call them "Bad Bots" or trespassers. We block them by IP exclusion which is another story entirely.
This is a very simple overview of robots.txt basics for webmasters. For a complete and thorough lesson, visit http://www.robotstxt.org/
To see the proper format for a somewhat standard robots.txt file look directly below. That file should be at the root of the domain because that is where the crawlers expect it to be, not in some secondary directory.
Below is the proper format for a robots.txt file ----->
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /group/
User-agent: msnbot
Crawl-delay: 10
User-agent: Teoma
Crawl-delay: 10
User-agent: Slurp
Crawl-delay: 10
User-agent: aipbot
Disallow: /
User-agent: BecomeBot
Disallow: /
User-agent: psbot
Disallow: /
--------> End of robots.txt file
This tiny text file is saved as a plain text document and ALWAYS with the name "robots.txt" in the root of your domain.
A quick review of the listed information from the robots.txt file above follows. The "User Agent: MSNbot" is from MSN, Slurp is from Yahoo and Teoma is from AskJeeves. The others listed are "Bad" bots that crawl very fast and to nobody's benefit but their own, so we ask them to stay out entirely. The * asterisk is a wild card that means "All" crawlers/spiders/bots should stay out of that group of files or directories listed.
The bots given the instruction "Disallow: /" means they should stay out entirely and those with "Crawl-delay: 10" are those that crawled our site too quickly and caused it to bog down and overuse the server resources. Google crawls more slowly than the others and doesn't require that instruction, so is not specifically listed in the above robots.txt file. Crawl-delay instruction is only needed on very large sites with hundreds or thousands of pages. The wildcard asterisk * applies to all crawlers, bots and spiders, including Googlebot.
Those we provided that "Crawl-delay: 10" instruction to were requesting as many as 7 pages every second and so we asked them to slow down. The number you see is seconds and you can change it to suit your server capacity, based on their crawling rate. Ten seconds between page requests is far more leisurely and stops them from asking for more pages than your server can dish up.
(You can discover how fast robots and spiders are crawling by looking at your raw server logs - which show pages requested by precise times to within a hundredth of a second - available from your web host or ask your web or IT person. Your server logs can be found in the root directory if you have server access, you can usually download compressed server log files by calendar day right off your server. You'll need a utility that can expand compressed files to open and read those plain text raw server log files.)
To see the contents of any robots.txt file just type robots.txt after any domain name. If they have that file up, you will see it displayed as a text file in your web browser. Click on the link below to see that file for Amazon.com
http://www.Amazon.com/robots.txt
You can see the contents of any website robots.txt file that way.
The robots.txt shown above is what we currently use at Publish101 Web Content Distributor, just launched in May of 2005. We did an extensive case study and published a series of articles on crawler behavior and indexing delays known as the Google Sandbox. That Google Sandbox Case Study is highly instructive on many levels for webmasters everywhere about the importance of this often ignored little text file.
One thing we didn't expect to glean from the research involved in indexing delays (known as the Google Sandbox) was the importance of robots.txt files to quick and efficient crawling by the spiders from the major search engines and the number of heavy crawls from bots that will do no earthly good to the site owner, yet crawl most sites extensively and heavily, straining servers to the breaking point with requests for pages coming as fast as 7 pages per second.
We discovered in our launch of the new site that Google and Yahoo will crawl the site whether or not you use a robots.txt file, but MSN seems to REQUIRE it before they will begin crawling at all. All of the search engine robots seem to request the file on a regular basis to verify that it hasn't changed.
Then when you DO change it, they will stop crawling for brief periods and repeatedly ask for that robots.txt file during that time without crawling any additional pages. (Perhaps they had a list of pages to visit that included the directory or files you have instructed them to stay out of and must now adjust their crawling schedule to eliminate those files from their list.)
Most webmasters instruct the bots to stay out of "image" directories and the "cgi-bin" directory as well as any directories containing private or proprietary files intended only for users of an intranet or password protected sections of your site. Clearly, you should direct the bots to stay out of any private areas that you don't want indexed by the search engines.
The importance of robots.txt is rarely discussed by average webmasters and I've even had some of my client business' webmasters ask me what it is and how to implement it when I tell them how important it is to both site security and efficient crawling by the search engines. This should be standard knowledge by webmasters at substantial companies, but this illustrates how little attention is paid to use of robots.txt.
The search engine spiders really do want your guidance and this tiny text file is the best way to provide crawlers and bots a clear signpost to warn off trespassers and protect private property - and to warmly welcome invited guests, such as the big three search engines while asking them nicely to stay out of private areas.
Copyright © August 17, 2005 by Mike Banks Valentine
Google Sandbox Case Study http://publish101.com/Sandbox2 Mike Banks Valentine operates http://Publish101.com Free Web Content Distribution for Article Marketers and Provides content aggregation, press release optimization and custom web content for Search Engine Positioning http://www.seoptimism.com/SEO_Contact.htm
![]() |
|
![]() |
|
![]() |
|
![]() |
Before you go and spend big money on a professional... Read More
Search engine optimization this and search engine optimization that. You... Read More
If you're going to sell any type of product or... Read More
According to the recent release of the Google Patent Application,... Read More
While searching the web these days, it's hard not to... Read More
Most Internet marketing methods are risky and many will not... Read More
Finding Targeted Keyword Phrases Your Competitors MissFinding keyword phrases your... Read More
With the ever evolving internet market for just about anything... Read More
I hear this all the time. "I can get you... Read More
Stealth as in Spying. But Knowing What your Top Competitors... Read More
As I read the latest news online about what Google... Read More
In order to tap the huge stream of targeted traffic... Read More
Web users turn to search engines for answers to their... Read More
The Cold Hard Facts?..One of the most important factors in... Read More
For websites, one of the most important things in their... Read More
That's right - I dreamt of a World Wide Web... Read More
The complexities of Google's PR (Page Ranking) System have grown... Read More
You have put lot of sweat in making your site.... Read More
Given that Google now provides over 75% of all Internet... Read More
Yesterday you should have read the first course out of... Read More
Search Engine Optimization (SEO) is something you should be aware... Read More
This is the first of a series of articles about... Read More
Google is the undisputed heavyweight champion of search engines. Most... Read More
One-way link building is a great way to improve your... Read More
You've just built a website and can't wait to start... Read More
We at America Web Works find ourselves amazed at the... Read More
OK, you published your site, now you just sit by... Read More
Among the many things you need to worry about for... Read More
Welcome to the second part of our series of articles... Read More
Today's article is about the wonders of SEO. SEO is... Read More
Internet Directories and their ImportanceThere are two very pertinent reason... Read More
Three Ways To Index Your Site With Google Sitemaps [Difficult,... Read More
As the economy begins to recover in certain parts of... Read More
Recent studies suggest that more than 80% of new visitors... Read More
Utilizing effective search engine optimization techniques will improve the page... Read More
Internet search engines exist to organize the seemingly immeasurable amount... Read More
So you have a site concept developed that you are... Read More
For a long time now, marketing gurus all over the... Read More
In a fluke, I was able to notice something about... Read More
Everyday, the Search Engines average 300 MILLION searches. In a... Read More
That's right - I dreamt of a World Wide Web... Read More
Yesterday you should have read the fifth course out of... Read More
If you are a webmaster, then you've probably submitted your... Read More
Search Engine OptimizationSearch engines still remain the #1 tool to... Read More
Getting your site noticed by the search engines and rewarded... Read More
It's taken you 6-months of hard work, constant changes, reading... Read More
What do the words "Search Engine" make you think of?... Read More
You've got a cool new website with all the works:... Read More
Having greatly benefited from my relationship with Google in the... Read More
A recent Search Engine Experiment Demonstrates how by combining Key... Read More
Search engine traffic should be a priority for any online... Read More
My Grandfather ran a small Grocery Store and when you... Read More
If you're reading this article, you've probably discovered that simply... Read More
Why get into Search engines and directories?Increase backlinks of your... Read More
So you want to get listed on Google? And you... Read More
The chase for a high web ranking is constantly on... Read More
With apologies for the cheap trick of mentioning Paris Hilton... Read More
The world of search engine optimization (SEO) can be a... Read More
I was recently contacted by one of my best clients... Read More
I couldn’t agree more with the headline of this article... Read More
Why is it that webmasters are so quick to blame... Read More
Why valid HTML code is crucial to your web site's... Read More
Keyword Research will reveal answers to 3 critical questions:1. Is... Read More
Over the last couple of weeks, I've received more e-mails... Read More
Localized search engine optimization is often overlooked as an excellent... Read More
If you have questions about whether or not the Over-Optimization... Read More
Search Engine Optimization (SEO) |