Emerging Entity Typing Dataset

Analyzing microblogs where we post what we experience enables us to perform various applications such as social-trend analysis and entity recommendation. To track emerging trends in a variety of areas, we want to categorize information on emerging entities (e.g., Avatar 2) in microblog posts according to their types (e.g., Film). We thus introduce a new entity typing task that assigns a fine-grained type to each emerging entity when a burst of posts containing that entity is first observed in a microblog. The challenge is to perform typing from noisy microblog posts without relying on prior knowledge of the target entity. To tackle this task, we build large-scale Twitter datasets for English and Japanese using time-sensitive distant supervision. We then propose a modular neural typing model that encodes not only the entity and its contexts but also meta information in multiple posts. To type 'homographic' emerging entities (e.g., 'Go' means an emerging programming language and a classic board game), which contexts are noisy, we devise a context selector that finds related contexts of the target entity. Experiments on the Twitter datasets confirm the effectiveness of our typing model and the context selector.

Dataset (download)

Since we provides only IDs of tweets used in our experiments, you should collect corresponding tweets using those IDs. If you have troubles collecting tweets, please contact the author. Note that all the Japanese tweets must be tokenized using MeCab ver. 0.996 with ipadic dictionary. You can easily do tokenization by using -O option. For installation, this document might be useful.



