There’s a little file called robots.txt that you should be aware of. As its name suggests, it’s a file specifically for robots and spiders… the honest ones anyway. It’s a good file to put on your server, if for nothing else than to save some bandwidth. You may be wondering why? Well, I was, too, not very long ago. I became aware pretty quickly, though, when I encountered a genuine need to place this file in the root directory of all my servers’ domains. Recently my hosting company, GBHXonline.com, invested in some new servers. After setting up the stable of hosted sites and the ones I actually manage, I realized I was dealing with something new.
Namely a newer version of PHP. Moreover, this version was running in strict mode. I could have changed the mode, heck with Apache Server running PHP there’s not a lot you can’t do, but I took another route and started debugging (mostly PHP notices, not so much in the way of errors). Gladly the vast majority of my scripting was fine. In fact there was very little I needed to do.
I accomplished all this by checking the server error logs domain by domain, crossing my “t”s and dotting my “i”s as I went. One thing bothered me, though. I had the recurring notice as spiders crawled the new locations. Apparently the spiders and robots kept looking for a root-level file called “robots.txt.” Well, the file wasn’t on the server and thus the ‘bots were coming up empty handed. And because they asked for a file they couldn’t get, the server just had to mention it in its report.
The fix was easy enough. Take a blank text file, name it “robots.txt” and place it on the server. And that’s all it took to silence the reporting and enjoy a completely empty error log. All was fine, but not for long. I wasn’t satisfied with my lazy fix. If I was going to put the file on the server I figured I may as well restrict some directories and stop making the ‘bots work so hard — especially since it was at my expense what with the bandwidth they consumed.
The idea behind a robots.txt file is to close off specific files and directories by telling the spiders not to crawl them. This is done by way of extremely simple directives. First, though, you need to address the spiders. Are you directing the GoogleBot, or do you want to direct all spiders. It is typical to address all and that’s what I did. Like so.
“User-agent” addresses spiders and robots in general. The asterisk or star is addressing all of them. Effectively that statement says “Hey robots! Yeah, all of you. Listen up, I have something to tell you.” If you had a thing for the GoogleBot and you wanted to address it only, you write the statement like this…
Now, the next part of this is to specify to the ‘bot(s) what directories and/or files you want them to skip, or access if you choose that. Again, this is very simple. Let’s say, for example, that you want to allow ‘bot specific access to a folder called /content/ but you want to keep robots out of your /admin/ folder, you’d add the following lines to the next line of the statements shown above.
The whole file, when complete, would look like this (addressing all ‘bots):
I had said using the robots.txt file saves me bandwidth. This is true, but only if employed smartly. You see, if the ‘bots and spiders are kept from crawling a specific file or folder, that file or folder isn’t transferred to the 1user-agent. No use equals no consumption. For example, want to save a lot of bandwidth, disallow access to your “/images/” folder.
You can do more and get more specific. Below you’ll see examples of directory exclusions, directory file exclusions, root-level file exclusions, specific user-agent addressing, an allowance, and a complete root-level disallow directive.
User-agent: * Disallow: /admin/ Disallow: /images/ # This pound sign marks a robots.txt file comment. # Put it on a new line only. Don't append directives. Disallow: /cgi-bin/ Disallow: /stats/ Disallow: /cms/ Disallow: /forms/ Disallow: /php/script.php Disallow: /secret.htm Disallow: /private.php Disallow: /clients/login.php # Below I address the GoogleBot telling it # to stay out of the folder and file shown User agent: googlebot Disallow: /google_envy/ms_2late.html # In the next lines I tell another 'bot to skip every file # and folder on the domain with "/" I do however, want # this 'bot to access one folder and file: User-agent: robozilla Allow: /content/hellozilla.htm Disallow: /
I recently made a robots.txt file for GreenBeastCMS users (see related article). There’s no need to have robots indexing those files so this was provided as a courtesy. The directories and files listed in this copy need to be added to the domain owner’s robots.txt file or the can use this one, even adding some of their own lines. For the CMS this was sort of overkill as the bots cannot access any of the directories only — everything leads to the login page unless logged in. Want to see another example or a larger example? Check out the massive robots.txt file for the Whitehouse, or check out Google’s (yeah, they have one too).
So, that’s about it. There are other ways to deal with robots such as using meta data in your file headers,
<meta name="robots" content="noindex, nofollow">, but the ‘bot has to actually get the file to see the directive so robots.txt seems to be a lot more effective, not to mention a lot easier to implement. Moreover, robots.txt allows greater specificity.
Here are some helpful links you might be able to use if you want more help or information:
- Robotstxt.org Web Robots Pages
- Web Tool Central Robots.txt Generator
- SXW.org Robots.txt Syntax Checker
Hopefully you can put this info to good use. Robots and spiders aren’t bad. Quite the contrary. They are generally good. They just need some direction. Please note that the directives given to these web crawlers is not obligatory. It will guide the “good” ‘bots, but they do so voluntarily. Google prefers not to waste its time crawling directories that won’t be indexed anyway. The bad ‘bots, though, the ones looking for exploitable data such as email addresses and other potentially sensitive information, will not be deterred. In other words using robots.txt will not provide any measure of additional security. The purpose they do serve, however, is a good one.
1A user-agent is any web-access device. “Robots” and “spiders,” which I use interchangeably, like web browsers such as Internet Explorer, Firefox, Opera, and Netscape, are all user-agents.