Introduction to Robots.txt

Posted October 12th, 2005 by Mike Cherim

There’s a little file called robots.txt that you should be aware of. As its name suggests, it’s a file specifically for robots and spiders… the honest ones anyway. It’s a good file to put on your server, if for nothing else than to save some bandwidth. You may be wondering why? Well, I was, too, not very long ago. I became aware pretty quickly, though, when I encountered a genuine need to place this file in the root directory of all my servers’ domains. Recently my hosting company, GBHXonline.com, invested in some new servers. After setting up the stable of hosted sites and the ones I actually manage, I realized I was dealing with something new.

Namely a newer version of PHP. Moreover, this version was running in strict mode. I could have changed the mode, heck with Apache Server running PHP there’s not a lot you can’t do, but I took another route and started debugging (mostly PHP notices, not so much in the way of errors). Gladly the vast majority of my scripting was fine. In fact there was very little I needed to do.

I accomplished all this by checking the server error logs domain by domain, crossing my “t”s and dotting my “i”s as I went. One thing bothered me, though. I had the recurring notice as spiders crawled the new locations. Apparently the spiders and robots kept looking for a root-level file called “robots.txt.” Well, the file wasn’t on the server and thus the ‘bots were coming up empty handed. And because they asked for a file they couldn’t get, the server just had to mention it in its report.

The fix was easy enough. Take a blank text file, name it “robots.txt” and place it on the server. And that’s all it took to silence the reporting and enjoy a completely empty error log. All was fine, but not for long. I wasn’t satisfied with my lazy fix. If I was going to put the file on the server I figured I may as well restrict some directories and stop making the ‘bots work so hard — especially since it was at my expense what with the bandwidth they consumed.

The idea behind a robots.txt file is to close off specific files and directories by telling the spiders not to crawl them. This is done by way of extremely simple directives. First, though, you need to address the spiders. Are you directing the GoogleBot, or do you want to direct all spiders. It is typical to address all and that’s what I did. Like so.
User-agent: *
“User-agent” addresses spiders and robots in general. The asterisk or star is addressing all of them. Effectively that statement says “Hey robots! Yeah, all of you. Listen up, I have something to tell you.” If you had a thing for the GoogleBot and you wanted to address it only, you write the statement like this…
User-agent: googlebot
Now, the next part of this is to specify to the ‘bot(s) what directories and/or files you want them to skip, or access if you choose that. Again, this is very simple. Let’s say, for example, that you want to allow ‘bot specific access to a folder called /content/ but you want to keep robots out of your /admin/ folder, you’d add the following lines to the next line of the statements shown above.
Allow: /content/ Disallow: /admin/
The whole file, when complete, would look like this (addressing all ‘bots):
User-agent: * Allow: /content/ Disallow: /admin/
I had said using the robots.txt file saves me bandwidth. This is true, but only if employed smartly. You see, if the ‘bots and spiders are kept from crawling a specific file or folder, that file or folder isn’t transferred to the ¹user-agent. No use equals no consumption. For example, want to save a lot of bandwidth, disallow access to your “/images/” folder.

You can do more and get more specific. Below you’ll see examples of directory exclusions, directory file exclusions, root-level file exclusions, specific user-agent addressing, an allowance, and a complete root-level disallow directive.

User-agent: *
Disallow: /admin/
Disallow: /images/
# This pound sign marks a robots.txt file comment.
# Put it on a new line only. Don't append directives.
Disallow: /cgi-bin/
Disallow: /stats/
Disallow: /cms/
Disallow: /forms/
Disallow: /php/script.php
Disallow: /secret.htm
Disallow: /private.php
Disallow: /clients/login.php
# Below I address the GoogleBot telling it
# to stay out of the folder and file shown
User agent: googlebot
Disallow: /google_envy/ms_2late.html
# In the next lines I tell another 'bot to skip every file
# and folder on the domain with "/" I do however, want
# this 'bot to access one folder and file:
User-agent: robozilla
Allow: /content/hellozilla.htm
Disallow: /

I recently made a robots.txt file for GreenBeastCMS users (see related article). There’s no need to have robots indexing those files so this was provided as a courtesy. The directories and files listed in this copy need to be added to the domain owner’s robots.txt file or the can use this one, even adding some of their own lines. For the CMS this was sort of overkill as the bots cannot access any of the directories only — everything leads to the login page unless logged in. Want to see another example or a larger example? Check out the massive robots.txt file for the Whitehouse, or check out Google’s (yeah, they have one too).

So, that’s about it. There are other ways to deal with robots such as using meta data in your file headers, <meta name="robots" content="noindex, nofollow">, but the ‘bot has to actually get the file to see the directive so robots.txt seems to be a lot more effective, not to mention a lot easier to implement. Moreover, robots.txt allows greater specificity.

Here are some helpful links you might be able to use if you want more help or information:

Hopefully you can put this info to good use. Robots and spiders aren’t bad. Quite the contrary. They are generally good. They just need some direction. Please note that the directives given to these web crawlers is not obligatory. It will guide the “good” ‘bots, but they do so voluntarily. Google prefers not to waste its time crawling directories that won’t be indexed anyway. The bad ‘bots, though, the ones looking for exploitable data such as email addresses and other potentially sensitive information, will not be deterred. In other words using robots.txt will not provide any measure of additional security. The purpose they do serve, however, is a good one.

¹A user-agent is any web-access device. “Robots” and “spiders,” which I use interchangeably, like web browsers such as Internet Explorer, Firefox, Opera, and Netscape, are all user-agents.

Filed under Wicked Wild Web, Security Matters —

7 Responses to: “Introduction to Robots.txt”

Mike Cherim responds:
Posted: October 14th, 2005 at 7:51 pm →

Update: In the large example above I had put spaces — blank lines — between the list of directives and the inserted comments below them. The spaces are there to help you more clearly see the directives from the comments. To add visual clarity. Please note, however, that the spaces are not allowed to be in the file unless there is a new user agent address. So you can have this:
User-agent: * Disallow: /admin/ # comment line Disallow: /images/

Or you can have this:
User-agent: * Disallow: /admin/# comment line User-agent: * Disallow: /images/
Also. Something else I didn’t note: Do not indent your lines. They can be, but it is not a preferred practice.
Martin Neczypor responds:
Posted: October 15th, 2005 at 12:42 pm →

Interesting, I should get this going on my site, because I’ve noticed quite a few robots going through my images folder which is quite full so it might save myself some bandwith, even though I’m no where near my limit . By the way, the design is looking quite nice: Is there any reason why you went with the dark grey for inactive inputs? I would think a green would be more…what’s the word… reasonable? No, that isn’t it, but do you know what I mean?

By the way, I really love how this textarea works; how it gets larger when you focus it; Does that only work in Safari/Firefox, or are you using javascript like in your contact form to do it?
Mike Cherim responds:
Posted: October 15th, 2005 at 1:52 pm →

I know what you mean. Bandwidth is cheap, lol.

To answer your questions: The green doesn’t provide enough contrast against the gray background on the contact form inputs not in focus, and I wanted to keep it the same in the blog. And that in itself isn’t a big deal. Except…

First the other part of your question: I use CSS only to make input hover/focus and textarea hover/focus expand. And as far as I know it works in all browsers — I tested the lastest FF, OP, NS — except IE. For IE I must use JavaScript (like in my contact form) to make the browser get styles for otherwise unsupported elements. As discussed in the article about my contact form IE only lets you do stuff like that to anchors, except you have to use a:active { Styles; }. It doesn’t support adding focus/hover effects to other elements. A real bummer it is.

So, back to the first question: If an IE user doesn’t support “Active Scripting” (JavaScript) I still want them to have a reasonable amount of contrast in the form fields.

NOTE: In the article about my contact form, I do not provide the form field expansion. There is no additional JavaScript involved, it’s just a matter of modifying the CSS like this:
/* This is the revised CSS */textarea { font-size : 100%; margin-top : 8px; width : 90%; height : 6em; padding : 4px; }input:focus, textarea:focus, select:focus, input.focus, textarea.focus, select.focus, input:hover, textarea:hover, select:hover, input.hover, textarea.hover, select.hover { color : #000; height : 1.6em; background : #fafafa; border : 1px solid #333; cursor : text; margin-left : 12px; }textarea:hover, textarea.hover { width : 90%; height : 6em; cursor : text; }textarea:focus, textarea.focus { width : 90%; height : 20em; cursor : text; }
Martin Neczypor responds:
Posted: October 15th, 2005 at 3:42 pm →

Yeah dude, that makes sense; nice job ;o).
Because I Write»Blog Archive » Stopping Search Engines responds:
Posted: October 19th, 2005 at 8:49 pm →

[…] 9 am

A good story about robots.txt is here at this Green-Beast I still think there must be something out there using the login scheme of WordPress which could […]
Victor Ma responds:
Posted: November 11th, 2005 at 3:24 am →

The way you handle hover and focus for Textarea is very interesting. I wonder if you know how I can “always” force the cursor stay inside a textarea (if it is the only input field on the web page)? Right now, when the cursor accidently touches any place outside the textarea, I lose my cursor postion in textarea and have to move it back to the right place inside textarea. This has been troubled me for a long time.
Mike Cherim responds:
Posted: November 11th, 2005 at 9:45 am →

“[…] I wonder if you know how I can “always” force the cursor stay inside a textarea (if it is the only input field on the web page)? […]”

I know with some JavaScript a person can start with focus on a textarea upon (Chat Beast actually does that). But I don’t think it is possible to 1) maintain that focus if you click on something else as focus simply cannot be shared and, 2) maintain cursor position when returning focus to that textarea.

I would suggest the Firefox browser (FF) for you if you are an Internet Explorer (IE) user. Might make things a little easier. With IE, if you scroll the page you’re on, you will lose any text area focus you had. With FF, though, you won’t. You can scroll the page and keep the text area in focus. This, to me, is a helpful feature. Might help you too.

Sorry. Comments are closed.

Beast-Blog.com Introduction to Robots.txt

Introduction to Robots.txt

7 Responses to: “Introduction to Robots.txt”

Beast-Blog.com
Introduction to Robots.txt