This page describe the common implementation gotcha when implementing a new dataset.
Legacy SplitGenerator
should be avoided
The old tfds.core.SplitGenerator
API is deprecated.
def _split_generator(...):
return [
tfds.core.SplitGenerator(name='train', gen_kwargs={'path': train_path}),
tfds.core.SplitGenerator(name='test', gen_kwargs={'path': test_path}),
]
Should be replaced by:
def _split_generator(...):
return {
'train': self._generate_examples(path=train_path),
'test': self._generate_examples(path=test_path),
}
Rationale: The new API is less verbose and more explicit. The old API will be removed in future version.
New datasets should be self-contained in a folder
When adding a dataset inside the tensorflow_datasets/
repository, please make
sure to follow the dataset-as-folder structure (all checksums, dummy data,
implementation code self-contained in a folder).
- Old datasets (bad):
<category>/<ds_name>.py
- New datasets (good):
<category>/<ds_name>/<ds_name>.py
Use the
TFDS CLI
(tfds new
, or gtfds new
for googlers) to generate the template.
Rationale: Old structure required absolute paths for checksums, fake data and was distributing the dataset files in many places. It was making it harder to implement datasets outside the TFDS repository. For consistency, the new structure should be used everywhere now.
Description lists should be formatted as markdown
The DatasetInfo.description
str
is formatted as markdown. Markdown lists
require an empty line before the first item:
_DESCRIPTION = """
Some text.
# << Empty line here !!!
1. Item 1
2. Item 1
3. Item 1
# << Empty line here !!!
Some other text.
"""
Rationale: Badly formatted description create visual artifacts in our catalog documentation. Without the empty lines, the above text would be rendered as:
Some text. 1. Item 1 2. Item 1 3. Item 1 Some other text
Forgot ClassLabel names
When using tfds.features.ClassLabel
, try to provide the human-readable labels
str
with names=
or names_file=
(instead of num_classes=10
).
features = {
'label': tfds.features.ClassLabel(names=['dog', 'cat', ...]),
}
Rationale: Human readable labels are used in many places:
- Allow to yield
str
directly in_generate_examples
:yield {'label': 'dog'}
- Exposed in the users like
info.features['label'].names
(conversion method.str2int('dog')
,... also available) - Used in the
visualization utils
tfds.show_examples
,tfds.as_dataframe
Forgot image shape
When using tfds.features.Image
, tfds.features.Video
, if the images have
static shape, they should be explicitly specified:
features = {
'image': tfds.features.Image(shape=(256, 256, 3)),
}
Rationale: It allow static shape inference (e.g.
ds.element_spec['image'].shape
), which is required for batching (batching
images of unknown shape would require resizing them first).
Prefer more specific type instead of tfds.features.Tensor
When possible, prefer the more specific types tfds.features.ClassLabel
,
tfds.features.BBoxFeatures
,... instead of the generic tfds.features.Tensor
.
Rationale: In addition of being more semantically correct, specific features provides additional metadata to users and are detected by tools.
Lazy imports in global space
Lazy imports should not be called from the global space. For example the following is wrong:
tfds.lazy_imports.apache_beam # << Error: Import beam in the global scope
def f() -> beam.Map:
...
Rationale: Using lazy imports in the global scope would import the module for all tfds users, defeating the purpose of lazy imports.
Dynamically computing train/test splits
If the dataset does not provide official splits, neither should TFDS. The following should be avoided:
_TRAIN_TEST_RATIO = 0.7
def _split_generator():
ids = list(range(num_examples))
np.random.RandomState(seed).shuffle(ids)
# Split train/test
train_ids = ids[_TRAIN_TEST_RATIO * num_examples:]
test_ids = ids[:_TRAIN_TEST_RATIO * num_examples]
return {
'train': self._generate_examples(train_ids),
'test': self._generate_examples(test_ids),
}
Rationale: TFDS try to provide datasets as close as the original data. The sub-split API should be used instead to let users dynamically create the subsplits they want:
ds_train, ds_test = tfds.load(..., split=['train[:80%]', 'train[80%:]'])
Python style guide
Prefer to use pathlib API
Instead of the tf.io.gfile
API, it is preferable to use the
pathlib API. All dl_manager
methods returns pathlib-like objects compatible with GCS, S3,...
path = dl_manager.download_and_extract('http://some-website/my_data.zip')
json_path = path / 'data/file.json'
json.loads(json_path.read_text())
Rationale: pathlib API is a modern object oriented file API which remove
boilerplate. Using .read_text()
/ .read_bytes()
also guarantee the files are
correctly closed.
If the method is not using self
, it should be a function
If a class method is not using self
, it should be a simple function (defined
outside the class).
Rationale: It makes it explicit to the reader that the function do not have side effects, nor hidden input/output:
x = f(y) # Clear inputs/outputs
x = self.f(y) # Does f depend on additional hidden variables ? Is it stateful ?
Lazy imports in Python
We lazily import big modules like TensorFlow. Lazy imports defer the actual
import of the module to the first usage of the module. So users who don't need
this big module will never import it. We use etils.epy.lazy_imports
.
from tensorflow_datasets.core.utils.lazy_imports_utils import tensorflow as tf
# After this statement, TensorFlow is not imported yet
...
features = tfds.features.Image(dtype=tf.uint8)
# After using it (`tf.uint8`), TensorFlow is now imported
Under the hood, the
LazyModule
class
acts as a factory, that will only actually import the module when an attribute
is accessed (__getattr__
).
You can also use it conveniently with a context manager:
from etils import epy
with epy.lazy_imports(error_callback=..., success_callback=...):
import some_big_module