$Date: 2006/09/27 07:13:16 $
(Japanese page is here)
Steeler is a web crawler (aka robot) , software to surf the web automatically. It is being developed and operated at Kitsuregawa Laboratory, The University of Tokyo . We are working on analysis and understanding of the nature of cyberspace based on the documents collected through the surfing.
While we intend to gather as many published documents as possible, we are really anxious about disturbing webmasters. If access by Steeler puts you to trouble, please indicate the fact according to the Robots Exclusion Standard or contact us as described below. Thank you for your cooperation.
Robots Exclusion Standard has been there for years to allow webmasters or authors to prevent their material from being crawled. It consists of the following two methods.
If you are a webmaster with appropriate permission, you can specify directives to the crawlers in /robots.txt file at the top of your site (i.e., http://www.your-site.com/robots.txt). For example, the following directive forbids Steeler to retrieve any contents from your site.
User-agent: Steeler
Disallow: /
Note that /robots.txt itself may be accessed multiple times
when its validity expires.
The Expires: HTTP header field can be used
to specify the expiration date of the /robots.txt.
If the field is missing, /robots.txt expires after 1 day.
For more details on directives, please refer to the revised specification of the Robots Exclusion Protocol (1996), which Steeler obeys (the original specification established in 1994 is available here).
You can also protect your contents in a file-by-file manner with robots meta tags. In a nutshell, if you put
<META NAME="robots" CONTENT="noindex,nofollow">in the head of your HTML documents, Steeler will not follow the links within the documents.
Alternatively, Steeler makes use of the Referer: HTTP header field, which is usually logged, to indicate the cause of the problems.
Note that a site with different names can be regarded as different sites. See the next item for more details.While the intervals for huge sites may be as low as several seconds, they are adjusted around one minute for typical sites.
Currently, we interleave the crawling processes with the host aliases detection processes. That said, a server may be accessed simultaneously under multiple host names.
If you have more questions or requests, feel free to send us e-mail at crawler@tkl.iis.u-tokyo.ac.jp. Please clarify your site's host name(s) in your message.