Here’s an interesting article about robots.txt
files and perhaps an even more interesting discussion in the ensuing comments.
The article takes issue what the robots.txt file does and where it is placed. It raised some interesting points that I enjoyed thinking about in that part of my brain that spins off and thinks about things while the rest of my brain tries to stay focused. The author says that, since the Robot Exclusion Protocol requires robots.txt
to reside in a hard-coded location with respect to your domain name, it basically requires all legitimate robots to fish for information from your site: before they request a single page from any site they must first request a robots.txt
file that may or may not exist. They didn’t follow a link to the file, the way you get to all other files on the WWW. They simply reach out there to see if a particular file exists on your domain without any real reason to suspect that it does. That’s fishing. And it uses bandwidth even for those sites that have no robots.txt
, because they have to return a 404 error page.
Now that in itself doesn’t strike me as a compelling reason to insist that the protocol specify a link-based system for robots to discover your robots.txt
file if it exists, but it begs the question of how many files may eventually be placed in a hardcoded location and therefor require fishing to find them. Once we have 100 such files will we be tired of such bandwidth draining requests for files specified by protocols that we don’t support on our site and begin wishing for a link-y method for a robot to discover whether we have that file on our site?