- Description:
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.
There are two features: - text: wikihow answers texts. - headline: bold lines as summary.
There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.
Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.
Additional Documentation: Explore on Papers With Code
Source code:
tfds.summarization.Wikihow
Versions:
1.2.0
(default): No release notes.
Download size:
5.21 MiB
Manual download instructions: This dataset requires you to download the source data manually into
download_config.manual_dir
(defaults to~/tensorflow_datasets/downloads/manual/
):
Links to files can be found on https://github.com/mahnazkoupaee/WikiHow-Dataset Please download both wikihowAll.csv and wikihowSep.csv.Auto-cached (documentation): No
Supervised keys (See
as_supervised
doc):('text', 'headline')
Figure (tfds.show_examples): Not supported.
Citation:
@misc{koupaee2018wikihow,
title={WikiHow: A Large Scale Text Summarization Dataset},
author={Mahnaz Koupaee and William Yang Wang},
year={2018},
eprint={1810.09305},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
wikihow/all (default config)
Config description: Use the concatenation of all paragraphs as the articles and the bold lines as the reference summaries
Dataset size:
531.56 MiB
Splits:
Split | Examples |
---|---|
'test' |
5,577 |
'train' |
157,252 |
'validation' |
5,599 |
- Feature structure:
FeaturesDict({
'headline': Text(shape=(), dtype=string),
'text': Text(shape=(), dtype=string),
'title': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
headline | Text | string | ||
text | Text | string | ||
title | Text | string |
- Examples (tfds.as_dataframe):
wikihow/sep
Config description: use each paragraph and its summary.
Dataset size:
1.07 GiB
Splits:
Split | Examples |
---|---|
'test' |
37,800 |
'train' |
1,060,732 |
'validation' |
37,932 |
- Feature structure:
FeaturesDict({
'headline': Text(shape=(), dtype=string),
'overview': Text(shape=(), dtype=string),
'sectionLabel': Text(shape=(), dtype=string),
'text': Text(shape=(), dtype=string),
'title': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
headline | Text | string | ||
overview | Text | string | ||
sectionLabel | Text | string | ||
text | Text | string | ||
title | Text | string |
- Examples (tfds.as_dataframe):