Steeler is a Web crawler (aka robot), software to surf the web autonomously, operated at Kitsuregawa Laboratory, The University of Tokyo. If you see something like
Mozilla/5.0 (compatible; Steeler/3.5; http://www.tkl.iis.u-tokyo.ac.jp/~crawler/)or source IP addresses within
While we intend to gather as many published documents as possible to study various social phenomena, we are anxious about disturbing webmasters. If Steeler's access is annoying to you, please indicate the fact according to the Robots Exclusion Standard or contact us as described below. Thank you for your cooperation.
Robots Exclusion Standard has been there for years to allow webmasters or authors to prevent their material from being crawled. It consists of the following two methods.
If you are a webmaster with appropriate permission, you can specify directives to the crawlers in /robots.txt file at the top of your site (i.e., http://www.your-site.com/robots.txt). For example, the following directive forbids Steeler to retrieve any content from your site.
User-agent: Steeler Disallow: /
In addition to path prefixes, Disallow may contain wildcard character "*" and end-of-path designator "$". For example, the following forbids access to the contents below /images directory as well as the files with .gif suffix.
User-agent: Steeler Disallow: /images/ Disallow: *.gif$
If the frequency of access matters, specify Crawl-delay parameter. For example, the following directs Steeler to access the site at most once per 30 seconds.
User-agent: Steeler Crawl-delay: 30.0
If you can edit HTML sources (or templates), you can also protect the contents in a file-by-file manner with robots meta tags. In a nutshell, if you put
<META NAME="robots" CONTENT="noindex,nofollow">in the header of your HTML documents, Steeler will not follow the links found in the documents.
Note that Steeler obeys a new Robots Exclusion Protocol, which major search engines adopted around 2008. It is a revision of the original protocol proposed in 1990's.
Crawling starts at a small set of URLs called seeds. URLs not included in the seeds need to be discovered in the fetched pages during the crawling process. Steeler makes use of the Referer HTTP header field, which is usually logged, to inform the entry point to your site.
Because we can't determine the validity of the discovered URLs without actually accessing them, pages with stale links may cause a lot of 404 (Not found) errors (unless you have blocked crawlers). However this seems malicious, we can hardly avoid it.
Steeler accesses each site on a page-by-page manner with some intervals. While the intervals for huge sites may be as low as several seconds, they are adjusted around several tens of seconds for typical sites. If the access interval is problematic, please specify the above Crawl-delay in your /robots.txt.
Note that a site with different names can be regarded as different sites and may get more accesses. See the next item for more details.
Some servers have multiple names accidentally or intentionally (e.g., http://www.foo.com/, http://bar.foo.com/, etc.). We cannot simply choose one of names because they may be intended as virtual hosts with individual contents. Though we interleave the crawling processes with the processes of detecting host aliases, chances are that an aliased server may be accessed simultaneously under different host names.
If you have more questions or requests, feel free to send us email at crawler (at) tkl.iis.u-tokyo.ac.jp (replace "(at)" with @). Please clarify host name(s) and IP address(es) of your site in the message.