View source on GitHub |
Loads the federated Stack Overflow dataset.
tff.simulation.datasets.stackoverflow.load_data(
cache_dir=None
)
Downloads and caches the dataset locally. If previously downloaded, tries to load the dataset from cache.
This dataset is derived from the Stack Overflow Data hosted by kaggle.com and available to query through Kernels using the BigQuery API: https://www.kaggle.com/stackoverflow/stackoverflow The Stack Overflow Data is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
The data consists of the body text of all questions and answers. The bodies were parsed into sentences, and any user with fewer than 100 sentences was expunged from the data. Minimal preprocessing was performed as follows:
- Lowercase the text,
- Unescape HTML symbols,
- Remove non-ascii symbols,
- Separate punctuation as individual tokens (except apostrophes and hyphens),
- Removing extraneous whitespace,
- Replacing URLS with a special token.
In addition the following metadata is available:
- Creation date
- Question title
- Question tags
- Question score
- Type ('question' or 'answer')
The data is divided into three sets:
- Train: Data before 2018-01-01 UTC except the held-out users. 342,477 unique users with 135,818,730 examples.
- Held-out: All examples from users with user_id % 10 == 0 (all dates). 38,758 unique users with 16,491,230 examples.
- Test: All examples after 2018-01-01 UTC except from held-out users. 204,088 unique users with 16,586,035 examples.
The tf.data.Datasets
returned by
tff.simulation.datasets.ClientData.create_tf_dataset_for_client
will yield
collections.OrderedDict
objects at each iteration, with the following keys
and values, in lexicographic order by key:
'creation_date'
: atf.Tensor
withdtype=tf.string
and shape [] containing the date/time of the question or answer in UTC format.'score'
: atf.Tensor
withdtype=tf.int64
and shape [] containing the score of the question.'tags'
: atf.Tensor
withdtype=tf.string
and shape [] containing the tags of the question, separated by '|' characters.'title'
: atf.Tensor
withdtype=tf.string
and shape [] containing the title of the question.'tokens'
: atf.Tensor
withdtype=tf.string
and shape [] containing the tokens of the question/answer, separated by space (' ') characters.'type'
: atf.Tensor
withdtype=tf.string
and shape [] containing either the string 'question' or 'answer'.
Args | |
---|---|
cache_dir
|
(Optional) directory to cache the downloaded file. If None ,
caches in Keras' default cache directory.
|
Returns | |
---|---|
Tuple of (train, held_out, test) where the tuple elements are
tff.simulation.datasets.ClientData objects.
|