- Description:
SAMSum Corpus contains over 16k chat dialogues with manually annotated summaries.
There are two features:
- dialogue: text of dialogue.
- summary: human written summary of the dialogue.
id: id of an example.
Additional Documentation: Explore on Papers With Code
Homepage: https://arxiv.org/src/1911.12237v2/anc
Source code:
tfds.datasets.samsum.Builder
Versions:
1.0.0
(default): No release notes.
Download size:
Unknown size
Dataset size:
10.71 MiB
Manual download instructions: This dataset requires you to download the source data manually into
download_config.manual_dir
(defaults to~/tensorflow_datasets/downloads/manual/
):
Download https://arxiv.org/src/1911.12237v2/anc/corpus.7z, decompress and place train.json, val.json and test.json in the manual follder.Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
819 |
'train' |
14,732 |
'validation' |
818 |
- Feature structure:
FeaturesDict({
'dialogue': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
'summary': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
dialogue | Text | string | ||
id | Text | string | ||
summary | Text | string |
Supervised keys (See
as_supervised
doc):('dialogue', 'summary')
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):
- Citation:
@article{gliwa2019samsum,
title={SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization},
author={Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander},
journal={arXiv preprint arXiv:1911.12237},
year={2019}
}