Introduction to Robots.txt

Posted October 12th, 2005 by Mike Cherim

There’s a little file called robots.txt that you should be aware of. As its name suggests, it’s a file specifically for robots and spiders… the honest ones anyway. It’s a good file to put on your server, if for nothing else than to save some bandwidth. You may be wondering why? Well, I was, too, not very long ago. I became aware pretty quickly, though, when I encountered a genuine need to place this file in the root directory of all my servers’ domains. Recently my hosting company, GBHXonline.com, invested in some new servers. After setting up the stable of hosted sites and the ones I actually manage, I realized I was dealing with something new.

Namely a newer version of PHP. Moreover, this version was running in strict mode. I could have changed the mode, heck with Apache Server running PHP there’s not a lot you can’t do, but I took another route and started debugging (mostly PHP notices, not so much in the way of errors). Gladly the vast majority of my scripting was fine. In fact there was very little I needed to do.

I accomplished all this by checking the server error logs domain by domain, crossing my “t”s and dotting my “i”s as I went. One thing bothered me, though. I had the recurring notice as spiders crawled the new locations. Apparently the spiders and robots kept looking for a root-level file called “robots.txt.” Well, the file wasn’t on the server and thus the ‘bots were coming up empty handed. And because they asked for a file they couldn’t get, the server just had to mention it in its report.

The fix was easy enough. Take a blank text file, name it “robots.txt” and place it on the server. And that’s all it took to silence the reporting and enjoy a completely empty error log. All was fine, but not for long. I wasn’t satisfied with my lazy fix. If I was going to put the file on the server I figured I may as well restrict some directories and stop making the ‘bots work so hard — especially since it was at my expense what with the bandwidth they consumed.

The idea behind a robots.txt file is to close off specific files and directories by telling the spiders not to crawl them. This is done by way of extremely simple directives. First, though, you need to address the spiders. Are you directing the GoogleBot, or do you want to direct all spiders. It is typical to address all and that’s what I did. Like so.

User-agent: *

“User-agent” addresses spiders and robots in general. The asterisk or star is addressing all of them. Effectively that statement says “Hey robots! Yeah, all of you. Listen up, I have something to tell you.” If you had a thing for the GoogleBot and you wanted to address it only, you write the statement like this…

User-agent: googlebot

Now, the next part of this is to specify to the ‘bot(s) what directories and/or files you want them to skip, or access if you choose that. Again, this is very simple. Let’s say, for example, that you want to allow ‘bot specific access to a folder called /content/ but you want to keep robots out of your /admin/ folder, you’d add the following lines to the next line of the statements shown above.

Allow: /content/
Disallow: /admin/

The whole file, when complete, would look like this (addressing all ‘bots):

User-agent: *
Allow: /content/
Disallow: /admin/

I had said using the robots.txt file saves me bandwidth. This is true, but only if employed smartly. You see, if the ‘bots and spiders are kept from crawling a specific file or folder, that file or folder isn’t transferred to the 1user-agent. No use equals no consumption. For example, want to save a lot of bandwidth, disallow access to your “/images/” folder.

You can do more and get more specific. Below you’ll see examples of directory exclusions, directory file exclusions, root-level file exclusions, specific user-agent addressing, an allowance, and a complete root-level disallow directive.

User-agent: *
Disallow: /admin/
Disallow: /images/
# This pound sign marks a robots.txt file comment.
# Put it on a new line only. Don't append directives.
Disallow: /cgi-bin/
Disallow: /stats/
Disallow: /cms/
Disallow: /forms/
Disallow: /php/script.php
Disallow: /secret.htm
Disallow: /private.php
Disallow: /clients/login.php
# Below I address the GoogleBot telling it
# to stay out of the folder and file shown
User agent: googlebot
Disallow: /google_envy/ms_2late.html
# In the next lines I tell another 'bot to skip every file
# and folder on the domain with "/" I do however, want
# this 'bot to access one folder and file:
User-agent: robozilla
Allow: /content/hellozilla.htm
Disallow: /

I recently made a robots.txt file for GreenBeastCMS users (see related article). There’s no need to have robots indexing those files so this was provided as a courtesy. The directories and files listed in this copy need to be added to the domain owner’s robots.txt file or the can use this one, even adding some of their own lines. For the CMS this was sort of overkill as the bots cannot access any of the directories only — everything leads to the login page unless logged in. Want to see another example or a larger example? Check out the massive robots.txt file for the Whitehouse, or check out Google’s (yeah, they have one too).

So, that’s about it. There are other ways to deal with robots such as using meta data in your file headers, <meta name="robots" content="noindex, nofollow">, but the ‘bot has to actually get the file to see the directive so robots.txt seems to be a lot more effective, not to mention a lot easier to implement. Moreover, robots.txt allows greater specificity.

Here are some helpful links you might be able to use if you want more help or information:

Hopefully you can put this info to good use. Robots and spiders aren’t bad. Quite the contrary. They are generally good. They just need some direction. Please note that the directives given to these web crawlers is not obligatory. It will guide the “good” ‘bots, but they do so voluntarily. Google prefers not to waste its time crawling directories that won’t be indexed anyway. The bad ‘bots, though, the ones looking for exploitable data such as email addresses and other potentially sensitive information, will not be deterred. In other words using robots.txt will not provide any measure of additional security. The purpose they do serve, however, is a good one.

1A user-agent is any web-access device. “Robots” and “spiders,” which I use interchangeably, like web browsers such as Internet Explorer, Firefox, Opera, and Netscape, are all user-agents.


7 Responses to: “Introduction to Robots.txt”

  1. Martin Neczypor responds:
    Posted: October 15th, 2005 at 12:42 pm

    Interesting, I should get this going on my site, because I’ve noticed quite a few robots going through my images folder which is quite full so it might save myself some bandwith, even though I’m no where near my limit :-P . By the way, the design is looking quite nice: Is there any reason why you went with the dark grey for inactive inputs? I would think a green would be more…what’s the word… reasonable? No, that isn’t it, but do you know what I mean?

    By the way, I really love how this textarea works; how it gets larger when you focus it; Does that only work in Safari/Firefox, or are you using javascript like in your contact form to do it?

  2. Martin Neczypor responds:
    Posted: October 15th, 2005 at 3:42 pm

    Yeah dude, that makes sense; nice job ;o).

  3. Because I Write»Blog Archive » Stopping Search Engines responds:
    Posted: October 19th, 2005 at 8:49 pm

    […] 9 am

    A good story about robots.txt is here at this Green-Beast I still think there must be something out there using the login scheme of WordPress which could […]

  4. Victor Ma responds:
    Posted: November 11th, 2005 at 3:24 am

    The way you handle hover and focus for Textarea is very interesting. I wonder if you know how I can “always” force the cursor stay inside a textarea (if it is the only input field on the web page)? Right now, when the cursor accidently touches any place outside the textarea, I lose my cursor postion in textarea and have to move it back to the right place inside textarea. This has been troubled me for a long time.

Sorry. Comments are closed.




Note: This is the end of the usable page. The image(s) below are preloaded for performance only.