Cleaned AE-pub Dataset

AliExpress dataset is one of the datasets on which the performance of attribute extraction models on real-world data can be measured. It consists of 110,484 tuples of <product title, attribute, value>. We manually inspected the tuples in the dataset, and found quality issues; some tuples contained HTML entities, and extra white spaces in titles, attributes, and values, and the same attributes sometimes have different letter cases. Furthermore, the dataset is not split into train/dev/test sets. For researchers who use the dataset, this is obstacle to compare models. To solve these problems, we release the script that canonicalizes tuples in the AliExpress dataset and splits the preprocessed tuples into train/dev/test sets with the ratio of 7:1:2. By using the script, you can re-construct Cleaned AE-pub dataset that we used for experiments in our paper.

How to Construct

Citation

      
@inproceedings{shinzato-etal-2022-simple,
    title = "Simple and Effective Knowledge-Driven Query Expansion for {QA}-Based Product Attribute Extraction",
    author = "Shinzato, Keiji and Yoshinaga, Naoki and Xia, Yandi and Chen, Wei-Te",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-short.25",
    pages = "227--234",
}