Emerging Entity Detection Dataset

Keeping up to date on emerging entities that appear every day is indispensable for various applications, such as social-trend analysis and marketing research. Previous studies have attempted to detect unseen entities that are not registered in a particular knowledge base as emerging entities and consequently find non-emerging entities since the absence of entities in knowledge bases does not guarantee their emergence. We therefore introduce a novel task of discovering truly emerging entities when they have just been introduced to the public through microblogs and propose an effective method based on time-sensitive distant supervision, which exploits distinctive early-stage contexts of emerging entities.


Our IJCAI paper have errata in the descriptions of the dataset used for evaluation. Please download corrected version from arxiv.

Dataset (download)

Since we provides only IDs of tweets used in our experiments, you should collect corresponding tweets using those IDs. If you have troubles collecting tweets, please contact the author. Note that all the tweets must be tokenized using MeCab ver. 0.996 with ipadic dictionary. You can easily do tokenization by using -O option. For installation, this document might be useful.

Training data (training)

Evaluation data (precision)

Evaluation data (relative_recall)


  title     = {Early Discovery of Emerging Entities in Microblogs_},
  author    = {Satoshi Akasaki, Naoki Yoshinaga and Masashi Toyoda},
  booktitle = {Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI2019)},
  pages     = {4882-4889},
  year      = {2019},