Steeler Information

$Date:: 2011-05-02 #$
(In Japanese)

Steeler: What is it? What is it doing?

Steeler is a Web crawler (aka robot), software to surf the web autonomously, operated at Kitsuregawa Laboratory, The University of Tokyo. If you see something like

        Mozilla/5.0 (compatible; Steeler/3.5; http://www.tkl.iis.u-tokyo.ac.jp/~crawler/)

or source IP addresses within

157.82.156.129 - 157.82.156.254

in your Web server's access log, those are footprints of Steeler.

While we intend to gather as many published documents as possible to study various social phenomena, we are anxious about disturbing webmasters. If Steeler's access is annoying to you, please indicate the fact according to the Robots Exclusion Standard or contact us as described below. Thank you for your cooperation.

How to keep crawlers out of your site

Robots Exclusion Standard has been there for years to allow webmasters or authors to prevent their material from being crawled. It consists of the following two methods.

The robots.txt file

If you are a webmaster with appropriate permission, you can specify directives to the crawlers in /robots.txt file at the top of your site (i.e., http://www.your-site.com/robots.txt). For example, the following directive forbids Steeler to retrieve any content from your site.

        User-agent: Steeler
        Disallow: /

In addition to path prefixes, Disallow may contain wildcard character "*" and end-of-path designator "$". For example, the following forbids access to the contents below /images directory as well as the files with .gif suffix.

        User-agent: Steeler
        Disallow: /images/
        Disallow: *.gif$

If the frequency of access matters, specify Crawl-delay parameter. For example, the following directs Steeler to access the site at most once per 30 seconds.

        User-agent: Steeler
        Crawl-delay: 30.0

Robots meta tags

If you can edit HTML sources (or templates), you can also protect the contents in a file-by-file manner with robots meta tags. In a nutshell, if you put

        <META NAME="robots" CONTENT="noindex,nofollow">

Note that Steeler obeys a new Robots Exclusion Protocol, which major search engines adopted around 2008. It is a revision of the original protocol proposed in 1990's.

How Steeler behaves

What brought Steeler to your site?

Crawling starts at a small set of URLs called seeds. URLs not included in the seeds need to be discovered in the fetched pages during the crawling process. Steeler makes use of the Referer HTTP header field, which is usually logged, to inform the entry point to your site.

Because we can't determine the validity of the discovered URLs without actually accessing them, pages with stale links may cause a lot of 404 (Not found) errors (unless you have blocked crawlers). However this seems malicious, we can hardly avoid it.

Access frequency

Steeler accesses each site on a page-by-page manner with some intervals. While the intervals for huge sites may be as low as several seconds, they are adjusted around several tens of seconds for typical sites. If the access interval is problematic, please specify the above Crawl-delay in your /robots.txt.

Note that a site with different names can be regarded as different sites and may get more accesses. See the next item for more details.

Aliased web servers

Some servers have multiple names accidentally or intentionally (e.g., http://www.foo.com/, http://bar.foo.com/, etc.). We cannot simply choose one of names because they may be intended as virtual hosts with individual contents. Though we interleave the crawling processes with the processes of detecting host aliases, chances are that an aliased server may be accessed simultaneously under different host names.

How to contact us

If you have more questions or requests, feel free to send us email at crawler (at) tkl.iis.u-tokyo.ac.jp (replace "(at)" with @). Please clarify host name(s) and IP address(es) of your site in the message.