- Description:
Reddit dataset, where TIFU denotes the name of subbreddit /r/tifu. As defined in the publication, style "short" uses title as summary and "long" uses tldr as summary.
Features includes:
- document: post text without tldr.
- tldr: tldr line.
- title: trimmed title without tldr.
- ups: upvotes.
- score: score.
- num_comments: number of comments.
upvote_ratio: upvote ratio.
Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/ctr4si/MMN
Source code:
tfds.datasets.reddit_tifu.Builder
Versions:
1.1.0
: Remove empty document and summary strings.1.1.1
: Add train, dev and test (80/10/10) splits which are used in PEGASUS (https://arxiv.org/abs/1912.08777) in a separate config. These were created randomly using the tfds split function and are being released to ensure that results on Reddit Tifu Long are reproducible and comparable.Also addid
to the datapoints.1.1.2
(default): Corrected splits uploaded.
Feature structure:
FeaturesDict({
'documents': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
'num_comments': float32,
'score': float32,
'title': Text(shape=(), dtype=string),
'tldr': Text(shape=(), dtype=string),
'ups': float32,
'upvote_ratio': float32,
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
documents | Text | string | ||
id | Text | string | ||
num_comments | Tensor | float32 | ||
score | Tensor | float32 | ||
title | Text | string | ||
tldr | Text | string | ||
ups | Tensor | float32 | ||
upvote_ratio | Tensor | float32 |
Figure (tfds.show_examples): Not supported.
Citation:
@misc{kim2018abstractive,
title={Abstractive Summarization of Reddit Posts with Multi-level Memory Networks},
author={Byeongchang Kim and Hyunwoo Kim and Gunhee Kim},
year={2018},
eprint={1811.00783},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
reddit_tifu/short (default config)
Config description: Using title as summary.
Download size:
639.54 MiB
Dataset size:
141.46 MiB
Auto-cached (documentation): Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'train' |
79,740 |
Supervised keys (See
as_supervised
doc):('documents', 'title')
Examples (tfds.as_dataframe):
reddit_tifu/long
Config description: Using TLDR as summary.
Download size:
639.54 MiB
Dataset size:
93.10 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'train' |
42,139 |
Supervised keys (See
as_supervised
doc):('documents', 'tldr')
Examples (tfds.as_dataframe):
reddit_tifu/long_split
Config description: Using TLDR as summary and return train/test/dev splits.
Download size:
639.94 MiB
Dataset size:
93.10 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
4,214 |
'train' |
33,711 |
'validation' |
4,214 |
Supervised keys (See
as_supervised
doc):('documents', 'tldr')
Examples (tfds.as_dataframe):