- Description:
ASSET is a dataset for evaluating Sentence Simplification systems with multiple rewriting transformations, as described in "ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations." The corpus is composed of 2000 validation and 359 test original sentences that were each simplified 10 times by different annotators. The corpus also contains human judgments of meaning preservation, fluency and simplicity for the outputs of several automatic text simplification systems.
Additional Documentation: Explore on Papers With Code
Source code:
tfds.datasets.asset.Builder
Versions:
1.0.0
(default): Initial release.
Download size:
3.47 MiB
Auto-cached (documentation): Yes
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Citation:
@inproceedings{alva-manchego-etal-2020-asset,
title = "{ASSET}: {A} Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations",
author = "Alva-Manchego, Fernando and
Martin, Louis and
Bordes, Antoine and
Scarton, Carolina and
Sagot, Benoit and
Specia, Lucia",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.424",
pages = "4668--4679",
}
asset/simplification (default config)
Config description: A set of original sentences aligned with 10 possible simplifications for each.
Dataset size:
2.64 MiB
Splits:
Split | Examples |
---|---|
'test' |
359 |
'validation' |
2,000 |
- Feature structure:
FeaturesDict({
'original': Text(shape=(), dtype=string),
'simplifications': Sequence(Text(shape=(), dtype=string)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
original | Text | string | ||
simplifications | Sequence(Text) | (None,) | string |
- Examples (tfds.as_dataframe):
asset/ratings
Config description: Human ratings of automatically produced text simplification.
Dataset size:
1.44 MiB
Splits:
Split | Examples |
---|---|
'full' |
4,500 |
- Feature structure:
FeaturesDict({
'aspect': ClassLabel(shape=(), dtype=int64, num_classes=3),
'original': Text(shape=(), dtype=string),
'original_sentence_id': int32,
'rating': int32,
'simplification': Text(shape=(), dtype=string),
'worker_id': int32,
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
aspect | ClassLabel | int64 | ||
original | Text | string | ||
original_sentence_id | Tensor | int32 | ||
rating | Tensor | int32 | ||
simplification | Text | string | ||
worker_id | Tensor | int32 |
- Examples (tfds.as_dataframe):