Emerging Entity Typing Dataset

Analyzing microblogs where we post what we experience enables us to perform various applications such as social-trend analysis and entity recommendation. To track emerging trends in a variety of areas, we want to categorize information on emerging entities (e.g., Avatar 2) in microblog posts according to their types (e.g., Film). We thus introduce a new entity typing task that assigns a fine-grained type to each emerging entity when a burst of posts containing that entity is first observed in a microblog. The challenge is to perform typing from noisy microblog posts without relying on prior knowledge of the target entity. To tackle this task, we build large-scale Twitter datasets for English and Japanese using time-sensitive distant supervision. We then propose a modular neural typing model that encodes not only the entity and its contexts but also meta information in multiple posts. To type 'homographic' emerging entities (e.g., 'Go' means an emerging programming language and a classic board game), which contexts are noisy, we devise a context selector that finds related contexts of the target entity. Experiments on the Twitter datasets confirm the effectiveness of our typing model and the context selector.

Dataset (download)

Since we provides only IDs of tweets used in our experiments, you should collect corresponding tweets using those IDs. If you have troubles collecting tweets, please contact the author. Note that all the Japanese tweets must be tokenized using MeCab ver. 0.996 with ipadic dictionary. You can easily do tokenization by using -O option. For installation, this document might be useful.

Data

Each folder in dataset contains English and Japanese data.
{en,ja}_{train,test}_{nonamb,amb} contain entity ID, entity name, fine-grained type, coarse-grained type. Note that nonamb and amb refer to "non-homographic entities" and "homographic entities" in the paper, respectively.
{en,ja}_{train,test}_ent_{emerging,prevalent,amb,nonamb} contains files of entity ID. Each file contains tweet IDs of contexts used for training and testing our proposed model. Note that emerging and prevalent refer to "emerging contexts" and "prevalent contexts" in the paper, respectively.
You can use those tweets to develop a typing model for emerging entities.

Citation

@inproceedings{akasaki2021ee,
  title     = {Fine-grained Typing of Emerging Entities in Microblogs},
  author    = {Satoshi Akasaki, Naoki Yoshinaga and Masashi Toyoda},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2021 (EMNLP2021 Findings)},
  pages     = {4667-4679},
  year      = {2021},
}