- Description:
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.
Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.
Additional Documentation: Explore on Papers With Code
Source code:
tfds.datasets.reddit.Builder
Versions:
1.0.0
(default): No release notes.
Download size:
2.93 GiB
Dataset size:
18.09 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'train' |
3,848,330 |
- Feature structure:
FeaturesDict({
'author': string,
'body': string,
'content': string,
'id': string,
'normalizedBody': string,
'subreddit': string,
'subreddit_id': string,
'summary': string,
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
author | Tensor | string | ||
body | Tensor | string | ||
content | Tensor | string | ||
id | Tensor | string | ||
normalizedBody | Tensor | string | ||
subreddit | Tensor | string | ||
subreddit_id | Tensor | string | ||
summary | Tensor | string |
Supervised keys (See
as_supervised
doc):('content', 'summary')
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):
- Citation:
@inproceedings{volske-etal-2017-tl,
title = "{TL};{DR}: Mining {R}eddit to Learn Automatic Summarization",
author = {V{\"o}lske, Michael and
Potthast, Martin and
Syed, Shahbaz and
Stein, Benno},
booktitle = "Proceedings of the Workshop on New Frontiers in Summarization",
month = sep,
year = "2017",
address = "Copenhagen, Denmark",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W17-4508",
doi = "10.18653/v1/W17-4508",
pages = "59--63",
abstract = "Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a {``}TL;DR{''} to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.",
}