TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

reddit_tifu

Description:

Reddit dataset, where TIFU denotes the name of subbreddit /r/tifu. As defined in the publication, style "short" uses title as summary and "long" uses tldr as summary.

Features includes:

document: post text without tldr.
tldr: tldr line.
title: trimmed title without tldr.
ups: upvotes.
score: score.
num_comments: number of comments.
upvote_ratio: upvote ratio.
Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/ctr4si/MMN
Source code: tfds.datasets.reddit_tifu.Builder
Versions:
- 1.1.0: Remove empty document and summary strings.
- 1.1.1: Add train, dev and test (80/10/10) splits which are used in PEGASUS (https://arxiv.org/abs/1912.08777) in a separate config. These were created randomly using the tfds split function and are being released to ensure that results on Reddit Tifu Long are reproducible and comparable.Also add id to the datapoints.
- 1.1.2 (default): Corrected splits uploaded.
Feature structure:

FeaturesDict({
    'documents': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'num_comments': float32,
    'score': float32,
    'title': Text(shape=(), dtype=string),
    'tldr': Text(shape=(), dtype=string),
    'ups': float32,
    'upvote_ratio': float32,
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
documents	Text	string
id	Text	string
num_comments	Tensor	float32
score	Tensor	float32
title	Text	string
tldr	Text	string
ups	Tensor	float32
upvote_ratio	Tensor	float32

Figure (tfds.show_examples): Not supported.
Citation:

@misc{kim2018abstractive,
    title={Abstractive Summarization of Reddit Posts with Multi-level Memory Networks},
    author={Byeongchang Kim and Hyunwoo Kim and Gunhee Kim},
    year={2018},
    eprint={1811.00783},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

reddit_tifu/short (default config)

Config description: Using title as summary.
Download size: 639.54 MiB
Dataset size: 141.46 MiB
Auto-cached (documentation): Only when shuffle_files=False (train)
Splits:

Split	Examples
`'train'`	79,740

Supervised keys (See as_supervised doc): ('documents', 'title')
Examples (tfds.as_dataframe):

reddit_tifu/long

Config description: Using TLDR as summary.
Download size: 639.54 MiB
Dataset size: 93.10 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	42,139

Supervised keys (See as_supervised doc): ('documents', 'tldr')
Examples (tfds.as_dataframe):

reddit_tifu/long_split

Config description: Using TLDR as summary and return train/test/dev splits.
Download size: 639.94 MiB
Dataset size: 93.10 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'test'`	4,214
`'train'`	33,711
`'validation'`	4,214

Supervised keys (See as_supervised doc): ('documents', 'tldr')
Examples (tfds.as_dataframe):