uPromote Marketer --> Archive --> Robot's Exclusion

[an error occurred while processing this directive]

Robot's Exclusion
by Brian D. Chmielewski

The foundation of our business has been built upon helping people get their web sites indexed better within the spider-based search engines. Yet, today we are going to show you how to prevent search engines from indexing certain pages within your web site. If you remember from our earlier discussion on how the search engines work, you know that the spider based search engines utilize "robots" (also called robot spiders or crawlers) to tirelessly and continually index sites they find from the web.

But what if you have certain pages within your site that you do not want to be put on the web? There are many potential reasons to keep some pages from being indexed. For instance, you may want to prevent people from seeing pages that are under construction. You may have areas within your site that possess secure or proprietary information about your company. Another example is Real estate agents often have temporary pages for properties that they are selling but once the properties are sold the page need not be found in a search engine.

So, what can you do to exclude some segments of your site from public view, while allowing other segments of your site to be spidered for indexing in a search engine's database? The answer is to set up the instruments of the Robots Exclusion Standard, which allows robots to identify portions of a site that should not be visited.

Termed the Robots Exclusion Protocol, this is a method that allows web site administrators to indicate to visiting robots which parts of their site should not be visited. When a robot first visits a web site, it confirms its walking orders from the robot text file. If it can find this file, it will analyze its contents to see if it may retrieve further information. You can customize the robots.txt file to apply only to specific robots, and to disallow access to specific directories or files. Following is a portion of WebPromote's robots.txt file. The entire file can be viewed at http://www.uPromote.com/robots.txt.

# robots.txt for http://upromote.com/
User-agent: *
Allow: /cgi-bin/se/t
Allow: /upromo/temp/TEST
Disallow: /upbla
Disallow: /makes
Disallow: /upword

All robots.txt records start with one or more user-agent lines, specifying which robot(s) the record applies to. What's a user agent line? It is simply a line that indicates which agent - the technical name for a spider, crawler, scout, etc. - should follow the robots.txt. To evaluate if access to a URL is allowed, a robot must attempt to match the paths in the Allow and Disallow lines against the URL, in the order they occur in the record. The first match found is used. If no match is found, the default assumption is that the URL is allowed.

The symbol, "*" in the user-agent title indicates that all search engines should be excluded. As mentioned above, you can have multiple user-agent lines and change the "*" to a robot's specific name token, (ex. Gulliver for Northernlight, ArchitextSpider for Excite, or Slurp/2.0 for HotBot) to alienate a single robot. The part following the allow or disallow instructions indicates the directory the spider's command should recognize. In the above example, all spiders will be excluded from the directories titled /wpbla, /makes, and /wpword and allowed to index all directories not mentioned in the robots.txt and the directories titled /cgi-bin/se/t and /upromo/temp/TEST. If you place "/" by itself after disallow, like NASA did in its robots.txt, your entire site will NOT be indexed.

Many servers have a robots.txt file in place. You can view the public allow and disallow orders of participating sites by entering any URL in the location box on your browser and adding /robots.txt to the end of it.

If you don't own your server and the administrator of your site is unwelcome to the idea of robots.txt, you can use HTML robot META tags to secure exclusion support. At this time, not all search engines recognize the robot META tag as an exclusion indicator, but for the majority of us, it is the easiest method for maintaining and installing exclusion capabilities into desired web pages.

Just as META "description" and "keyword" tags provide descriptive information about your site, the META "robot" tag indicates where not to probe for that description. The basic idea is that if you include a tag in the element of your HTML document, that document will not be archived, and no links on the page will be followed. To indicate to visiting robots that a document may not be indexed, place the simple script in the of your HTML as follows.

HTML
HEAD
TITLE "Your Site Title" /TITLE
META name="robots" content="noindex,nofollow"
META name="description" content="Your web site description"
META name="keywords" content="Your unique keywords"
/HEAD

The adherence to the Protocol for Robots Exclusion is proof positive that we can play nicely on the Internet. Absolutely nobody is required to participate in this particular method for exclusion, and there are no guarantees that some robots will not visit restricted parts of your URL's space. In the words on the official Protocol for Robots Exclusion web site, this method "is not an official standard backed by a standards body, or owned by any commercial organization. It is not enforced by anybody, and there no guarantee that all current and future robots will use it."

Therefore, you should take the proper measures to insure that restricted information is really being protected from public display. Failure to use proper authentication or other restriction may result in exposure of restricted information. It is very likely that the occurrence of paths through directories in your robots.txt file may expose the existence of resources not otherwise linked to on the site, which may aid people guessing for URLs. So, if someone can publicly view server directories that are excluded, they may attempt to infiltrate your site.

Alternatively, you can assimilate the Perl programming script for Robot's Exclusion into each web page that you do not want to be spidered. Visit http://www.vperson.com/mlm/rnw.html to get the script. The Protocol for Robots Exclusion web site is at http://info.webcrawler.com/mak/projects/robots/robots.html.