PDA

View Full Version : Bots....again, maybe a solution



mjc
05-29-2001, 02:51 AM
Ixl,

Robots
robot.txt can exclude directories from being scanned.
The Robot will simply look for a "/robots.txt" URL on your site, where a site is defined as a HTTP server running on a particular host and port number. For example: http://w3.org/robots.txt

Also, remember that URL's are case sensitive, and "/robots.txt" must be all lower-case.

The "/robots.txt" file usually contains a record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve.
To exclude all robots from the entire server

User-agent: *
Disallow: /

To allow all robots complete access
User-agent: *
Disallow:

To exclude a single robot

User-agent: BadBot
Disallow: /

To allow a single robot

User-agent: WebCrawler
Disallow:

User-agent: *
Disallow: /

Picked it up here:Tips from Herb (http://webwi.de/data/web.htm)

------------------
mjc
Links list:Computer Links (http://www.fortunecity.com/skyscraper/highrise/11/index.htm)

Celts are the men that heaven made mad, For all their battles are merry and their songs are all sad.

Charles Kozierok
05-29-2001, 10:10 AM
Thanks for the suggestion, but I've had a robots.txt file for some time. The problem is that the poorly-behaved robots just ignore it; it's a voluntary standard.

------------------
Charles M. Kozierok
Webslave, The PC Guide (http://www.PCGuide.com)
Comprehensive PC Reference, Troubleshooting, Optimization and Buyer's Guides...
Note: Please reply to my forum postings here on the forums. Thanks.