|
[an error occurred while processing this directive] |
|
|
Robot's Exclusion
by Brian D. Chmielewski
The foundation of our business has been built upon helping people get
their web sites indexed better within the spider-based search engines.
Yet, today we are going to show you how to prevent search engines
from indexing certain pages within your web site.
If you remember from our earlier discussion on how the search engines
work, you know that the spider based search engines utilize "robots"
(also called robot spiders or crawlers) to tirelessly and continually
index sites they find from the web.
But what if you have certain pages within your site that you do not
want to be put on the web? There are many potential reasons to keep
some pages from being indexed. For instance, you may want to prevent
people from seeing pages that are under construction. You may have
areas within your site that possess secure or proprietary information
about your company. Another example is Real estate agents often have
temporary pages for properties that they are selling but once the
properties are sold the page need not be found in a search engine.
So, what can you do to exclude some segments of your site from public
view, while allowing other segments of your site to be spidered for
indexing in a search engine's database? The answer is to set up the
instruments of the Robots Exclusion Standard, which allows robots
to identify portions of a site that should not be visited.
Termed the Robots Exclusion Protocol, this is a method that allows web
site administrators to indicate to visiting robots which parts of
their site should not be visited. When a robot first visits a web
site, it confirms its walking orders from the robot text file. If it
can find this file, it will analyze its contents to see if it may
retrieve further information. You can customize the robots.txt file to
apply only to specific robots, and to disallow access to specific
directories or files. Following is a portion of WebPromote's
robots.txt file. The entire file can be viewed at
http://www.uPromote.com/robots.txt.
# robots.txt for http://upromote.com/
User-agent: *
Allow: /cgi-bin/se/t
Allow: /upromo/temp/TEST
Disallow: /upbla
Disallow: /makes
Disallow: /upword
All robots.txt records start with one or more user-agent lines,
specifying which robot(s) the record applies to. What's a user agent
line? It is simply a line that indicates which agent - the technical
name for a spider, crawler, scout, etc. - should follow the
robots.txt. To evaluate if access to a URL is allowed, a robot must
attempt to match the paths in the Allow and Disallow lines against the
URL, in the order they occur in the record. The first match found is
used. If no match is found, the default assumption is that the URL is
allowed.
The symbol, "*" in the user-agent title indicates that all search
engines should be excluded. As mentioned above, you can have multiple
user-agent lines and change the "*" to a robot's specific name token,
(ex. Gulliver for Northernlight, ArchitextSpider for Excite, or
Slurp/2.0 for HotBot) to alienate a single robot. The part following
the allow or disallow instructions indicates the directory the
spider's command should recognize. In the above example, all spiders
will be excluded from the directories titled /wpbla, /makes, and
/wpword and allowed to index all directories not mentioned in the
robots.txt and the directories titled /cgi-bin/se/t and
/upromo/temp/TEST. If you place "/" by itself after disallow, like
NASA did in its robots.txt, your entire site will NOT be indexed.
Many servers have a robots.txt file in place. You can view the public
allow and disallow orders of participating sites by entering any URL
in the location box on your browser and adding /robots.txt to the end
of it.
If you don't own your server and the administrator of your site is
unwelcome to the idea of robots.txt, you can use HTML robot META tags
to secure exclusion support. At this time, not all search engines
recognize the robot META tag as an exclusion indicator, but for the
majority of us, it is the easiest method for maintaining and
installing exclusion capabilities into desired web pages.
Just as META "description" and "keyword" tags provide descriptive
information about your site, the META "robot" tag indicates where not
to probe for that description. The basic idea is that if you include
a tag in the element of your HTML document, that document will
not be archived, and no links on the page will be followed. To
indicate to visiting robots that a document may not be indexed, place
the simple script in the of your HTML as follows.
HTML
HEAD
TITLE "Your Site Title" /TITLE
META name="robots" content="noindex,nofollow"
META name="description" content="Your web site description"
META name="keywords" content="Your unique keywords"
/HEAD
The adherence to the Protocol for Robots Exclusion is proof positive
that we can play nicely on the Internet. Absolutely nobody is
required to participate in this particular method for exclusion, and
there are no guarantees that some robots will not visit restricted
parts of your URL's space. In the words on the official Protocol for
Robots Exclusion web site, this method "is not an official standard
backed by a standards body, or owned by any commercial organization.
It is not enforced by anybody, and there no guarantee that all current
and future robots will use it."
Therefore, you should take the proper measures to insure that
restricted information is really being protected from public display.
Failure to use proper authentication or other restriction may result
in exposure of restricted information. It is very likely that the
occurrence of paths through directories in your robots.txt file may
expose the existence of resources not otherwise linked to on the site,
which may aid people guessing for URLs. So, if someone can publicly
view server directories that are excluded, they may attempt to
infiltrate your site.
Alternatively, you can assimilate the Perl programming script for
Robot's Exclusion into each web page that you do not want to be
spidered. Visit
http://www.vperson.com/mlm/rnw.html to get the script. The Protocol
for
Robots Exclusion web site is at
http://info.webcrawler.com/mak/projects/robots/robots.html.
|
|
|