Cleaned AE-pub Dataset

AliExpress dataset is one of the datasets on which the performance of attribute extraction models on real-world data can be measured. It consists of 110,484 tuples of <product title, attribute, value>. We manually inspected the tuples in the dataset, and found quality issues; some tuples contained HTML entities, and extra white spaces in titles, attributes, and values, and the same attributes sometimes have different letter cases. Furthermore, the dataset is not split into train/dev/test sets. For researchers who use the dataset, this is obstacle to compare models. To solve these problems, we release the script that canonicalizes tuples in the AliExpress dataset and splits the preprocessed tuples into train/dev/test sets with the ratio of 7:1:2. By using the script, you can re-construct Cleaned AE-pub dataset that we used for experiments in our paper.

How to Construct

Download the orignal AliExpress dataset.
Download our script.
Run the command below. train.tsv, dev.tsv and test.tsv will be created in the current dicretory.


	cat publish_data.txt | python3 script/create-dataset.py

Citation

      
@inproceedings{shinzato-etal-2022-simple,
    title = "Simple and Effective Knowledge-Driven Query Expansion for {QA}-Based Product Attribute Extraction",
    author = "Shinzato, Keiji and Yoshinaga, Naoki and Xia, Yandi and Chen, Wei-Te",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-short.25",
    pages = "227--234",
}