tf.data.TFRecordDataset

A Dataset comprising records from one or more TFRecord files.

Inherits From: Dataset

tf.data.TFRecordDataset(
    filenames,
    compression_type=None,
    buffer_size=None,
    num_parallel_reads=None,
    name=None
)

This dataset loads TFRecords from the files as bytes, exactly as they were written.TFRecordDataset does not do any parsing or decoding on its own. Parsing and decoding can be done by applying Dataset.map transformations after the TFRecordDataset.

A minimal example is given below:

import tempfile
example_path = os.path.join(tempfile.gettempdir(), "example.tfrecords")
np.random.seed(0)

# Write the records to a file.
with tf.io.TFRecordWriter(example_path) as file_writer:
  for _ in range(4):
    x, y = np.random.random(), np.random.random()

    record_bytes = tf.train.Example(features=tf.train.Features(feature={
        "x": tf.train.Feature(float_list=tf.train.FloatList(value=[x])),
        "y": tf.train.Feature(float_list=tf.train.FloatList(value=[y])),
    })).SerializeToString()
    file_writer.write(record_bytes)

# Read the data back out.
def decode_fn(record_bytes):
  return tf.io.parse_single_example(
      # Data
      record_bytes,

      # Schema
      {"x": tf.io.FixedLenFeature([], dtype=tf.float32),
       "y": tf.io.FixedLenFeature([], dtype=tf.float32)}
  )

for batch in tf.data.TFRecordDataset([example_path]).map(decode_fn):
  print("x = {x:.4f},  y = {y:.4f}".format(**batch))
x = 0.5488,  y = 0.7152
x = 0.6028,  y = 0.5449
x = 0.4237,  y = 0.6459
x = 0.4376,  y = 0.8918

Args
`filenames`	A `tf.string` tensor or `tf.data.Dataset` containing one or more filenames.
`compression_type`	(Optional.) A `tf.string` scalar evaluating to one of `""` (no compression), `"ZLIB"`, or `"GZIP"`.
`buffer_size`	(Optional.) A `tf.int64` scalar representing the number of bytes in the read buffer. If your input pipeline is I/O bottlenecked, consider setting this parameter to a value 1-100 MBs. If `None`, a sensible default for both local and remote file systems is used.
`num_parallel_reads`	(Optional.) A `tf.int64` scalar representing the number of files to read in parallel. If greater than one, the records of files read in parallel are outputted in an interleaved order. If your input pipeline is I/O bottlenecked, consider setting this parameter to a value greater than one to parallelize the I/O. If `None`, files will be read sequentially.
`name`	(Optional.) A name for the tf.data operation.

Raises
`TypeError`	If any argument does not have the expected type.
`ValueError`	If any argument does not have the expected shape.

Attributes
`element_spec`	The type specification of an element of this dataset. `dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])` `dataset.element_spec` `TensorSpec(shape=(), dtype=tf.int32, name=None)` For more information, read this guide.

Attributes

element_spec

The type specification of an element of this dataset.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset.element_spec
TensorSpec(shape=(), dtype=tf.int32, name=None)

For more information, read this guide.

Raises
`TypeError`	if an element contains a non-`Tensor` value.
`RuntimeError`	if eager execution is not enabled.

Args
`batch_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements of this dataset to combine in a single batch.
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size` elements; the default behavior is not to drop the smaller batch.
`num_parallel_calls`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the number of batches to compute asynchronously in parallel. If not specified, batches will be computed sequentially. If the value `tf.data.AUTOTUNE` is used, then the number of parallel calls is set dynamically based on available resources.
`deterministic`	(Optional.) When `num_parallel_calls` is specified, if this boolean is specified (`True` or `False`), it controls the order in which the transformation produces elements. If set to `False`, the transformation is allowed to yield elements out of order to trade determinism for performance. If not specified, the `tf.data.Options.deterministic` option (`True` by default) controls the behavior.
`name`	(Optional.) A name for the tf.data operation.

Args
`element_length_func`	function from element in `Dataset` to `tf.int32`, determines the length of the element, which will determine the bucket it goes into.
`bucket_boundaries`	`list<int>`, upper length boundaries of the buckets.
`bucket_batch_sizes`	`list<int>`, batch size per bucket. Length should be `len(bucket_boundaries) + 1`.
`padded_shapes`	Nested structure of `tf.TensorShape` to pass to `tf.data.Dataset.padded_batch`. If not provided, will use `dataset.output_shapes`, which will result in variable length dimensions being padded out to the maximum length in each batch.
`padding_values`	Values to pad with, passed to `tf.data.Dataset.padded_batch`. Defaults to padding with 0.
`pad_to_bucket_boundary`	bool, if `False`, will pad dimensions with unknown size to maximum length in batch. If `True`, will pad dimensions with unknown size to bucket boundary minus 1 (i.e., the maximum length in each bucket), and caller must ensure that the source `Dataset` does not contain any elements with length longer than `max(bucket_boundaries)`.
`no_padding`	`bool`, indicates whether to pad the batch features (features need to be either of type `tf.sparse.SparseTensor` or of same shape).
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last batch should be dropped in the case it has fewer than `batch_size` elements; the default behavior is not to drop the smaller batch.
`name`	(Optional.) A name for the tf.data operation.

Args
`filename`	A `tf.string` scalar `tf.Tensor`, representing the name of a directory on the filesystem to use for caching elements in this Dataset. If a filename is not provided, the dataset will be cached in memory.
`name`	(Optional.) A name for the tf.data operation.

Args
`datasets`	A non-empty list of `tf.data.Dataset` objects with compatible structure.
`choice_dataset`	A `tf.data.Dataset` of scalar `tf.int64` tensors between `0` and `len(datasets) - 1`.
`stop_on_empty_dataset`	If `True`, selection stops if it encounters an empty dataset. If `False`, it skips empty datasets. It is recommended to set it to `True`. Otherwise, the selected elements start off as the user intends, but may change as input datasets become empty. This can be difficult to detect since the dataset starts off looking correct. Defaults to `True`.

Raises
`TypeError`	If `datasets` or `choice_dataset` has the wrong type.
`ValueError`	If `datasets` is empty.

Args
`dataset`	`Dataset` to be concatenated.
`name`	(Optional.) A name for the tf.data operation.

Args
`start`	A `tf.int64` scalar `tf.Tensor`, representing the start value for enumeration.
`name`	Optional. A name for the tf.data operations used by `enumerate`.

Args
`predicate`	A function mapping a dataset element to a boolean.
`name`	(Optional.) A name for the tf.data operation.

Args
`map_func`	A function mapping a dataset element to a dataset.
`name`	(Optional.) A name for the tf.data operation.

Args
`generator`	A callable object that returns an object that supports the `iter()` protocol. If `args` is not specified, `generator` must take no arguments; otherwise it must take as many arguments as there are values in `args`.
`output_types`	(Optional.) A (nested) structure of `tf.DType` objects corresponding to each component of an element yielded by `generator`.
`output_shapes`	(Optional.) A (nested) structure of `tf.TensorShape` objects corresponding to each component of an element yielded by `generator`.
`args`	(Optional.) A tuple of `tf.Tensor` objects that will be evaluated and passed to `generator` as NumPy-array arguments.
`output_signature`	(Optional.) A (nested) structure of `tf.TypeSpec` objects corresponding to each component of an element yielded by `generator`.
`name`	(Optional.) A name for the tf.data operations used by `from_generator`.

Args
`tensors`	A dataset element, whose components have the same first dimension. Supported values are documented here.
`name`	(Optional.) A name for the tf.data operation.

Args
`tensors`	A dataset "element". Supported values are documented here.
`name`	(Optional.) A name for the tf.data operation.

Args
`key_func`	A function mapping a nested structure of tensors (having shapes and types defined by `self.output_shapes` and `self.output_types`) to a scalar `tf.int64` tensor.
`reduce_func`	A function mapping a key and a dataset of up to `window_size` consecutive elements matching that key to another dataset.
`window_size`	A `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements matching the same key to combine in a single batch, which will be passed to `reduce_func`. Mutually exclusive with `window_size_func`.
`window_size_func`	A function mapping a key to a `tf.int64` scalar `tf.Tensor`, representing the number of consecutive elements matching the same key to combine in a single batch, which will be passed to `reduce_func`. Mutually exclusive with `window_size`.
`name`	(Optional.) A name for the tf.data operation.

Args
`map_func`	A function that takes a dataset element and returns a `tf.data.Dataset`.
`cycle_length`	(Optional.) The number of input elements that will be processed concurrently. If not set, the tf.data runtime decides what it should be based on available CPU. If `num_parallel_calls` is set to `tf.data.AUTOTUNE`, the `cycle_length` argument identifies the maximum degree of parallelism.
`block_length`	(Optional.) The number of consecutive elements to produce from each input element before cycling to another input element. If not set, defaults to 1.
`num_parallel_calls`	(Optional.) If specified, the implementation creates a threadpool, which is used to fetch inputs from cycle elements asynchronously and in parallel. The default behavior is to fetch inputs from cycle elements synchronously with no parallelism. If the value `tf.data.AUTOTUNE` is used, then the number of parallel calls is set dynamically based on available CPU.
`deterministic`	(Optional.) When `num_parallel_calls` is specified, if this boolean is specified (`True` or `False`), it controls the order in which the transformation produces elements. If set to `False`, the transformation is allowed to yield elements out of order to trade determinism for performance. If not specified, the `tf.data.Options.deterministic` option (`True` by default) controls the behavior.
`name`	(Optional.) A name for the tf.data operation.

Args
`file_pattern`	A string, a list of strings, or a `tf.Tensor` of string type (scalar or vector), representing the filename glob (i.e. shell wildcard) pattern(s) that will be matched.
`shuffle`	(Optional.) If `True`, the file names will be shuffled randomly. Defaults to `True`.
`seed`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the random seed that will be used to create the distribution. See `tf.random.set_seed` for behavior.
`name`	Optional. A name for the tf.data operations used by `list_files`.

Raises
`ValueError`	If a component has an unknown rank, and the `padded_shapes` argument is not set.
`TypeError`	If a component is of an unsupported type. The list of supported types is documented in https://www.tensorflow.org/guide/data#dataset_structure

Args
`seed`	(Optional) If specified, the dataset produces a deterministic sequence of values.
`name`	(Optional.) A name for the tf.data operation.

Args
`initial_state`	An element representing the initial state of the transformation.
`reduce_func`	A function that maps `(old_state, input_element)` to `new_state`. It must take two arguments and return a new element The structure of `new_state` must match the structure of `initial_state`.
`name`	(Optional.) A name for the tf.data operation.

Args
`class_func`	A function mapping an element of the input dataset to a scalar `tf.int32` tensor. Values should be in `[0, num_classes)`.
`target_dist`	A floating point type tensor, shaped `[num_classes]`.
`initial_dist`	(Optional.) A floating point type tensor, shaped `[num_classes]`. If not provided, the true class distribution is estimated live in a streaming fashion.
`seed`	(Optional.) Python integer seed for the resampler.
`name`	(Optional.) A name for the tf.data operation.

Args
`count`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the number of times the dataset should be repeated. The default behavior (if `count` is `None` or `-1`) is for the dataset be repeated indefinitely.
`name`	(Optional.) A name for the tf.data operation.

tf.data.TFRecordDataset Stay organized with collections Save and categorize content based on your preferences.

Args

Raises

Attributes

Methods

apply

as_numpy_iterator

batch

bucket_by_sequence_length

cache

cardinality

choose_from_datasets

concatenate

enumerate

filter

flat_map

The type signature is:

from_generator

from_tensor_slices

from_tensors

get_single_element

Keras

Estimator

group_by_window

interleave

The type signature is:

For example:

list_files

map

options

padded_batch

prefetch

random

range

reduce

rejection_resample

repeat

sample_from_datasets

scan

shard

Important caveats:

shuffle

skip

snapshot

take

take_while

unbatch

unique

window

For example:

Shift

Stride

Nested elements

The type signature is:

Flatten a dataset of windows

with_options

zip

__bool__

__iter__

__len__

__nonzero__

tf.data.TFRecordDataset

`apply`

`as_numpy_iterator`

`batch`

`bucket_by_sequence_length`

`cache`

`cardinality`

`choose_from_datasets`

`concatenate`

`enumerate`

`filter`

`flat_map`

`from_generator`

`from_tensor_slices`

`from_tensors`

`get_single_element`

`group_by_window`

`interleave`

`list_files`

`map`

`options`

`padded_batch`

`prefetch`

`random`

`range`

`reduce`

`rejection_resample`

`repeat`

`sample_from_datasets`

`scan`

`shard`

`shuffle`

`skip`

`snapshot`

`take`

`take_while`

`unbatch`

`unique`

`window`

`with_options`

`zip`

`bool`

`iter`

`len`

`nonzero`

Args
`datasets`	A non-empty list of `tf.data.Dataset` objects with compatible structure.
`weights`	(Optional.) A list or Tensor of `len(datasets)` floating-point values where `weights[i]` represents the probability to sample from `datasets[i]`, or a `tf.data.Dataset` object where each element is such a list. Defaults to a uniform distribution across `datasets`.
`seed`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the random seed that will be used to create the distribution. See `tf.random.set_seed` for behavior.
`stop_on_empty_dataset`	If `True`, sampling stops if it encounters an empty dataset. If `False`, it skips empty datasets. It is recommended to set it to `True`. Otherwise, the distribution of samples starts off as the user intends, but may change as input datasets become empty. This can be difficult to detect since the dataset starts off looking correct. Default to `False` for backward compatibility.

Args
`initial_state`	A nested structure of tensors, representing the initial state of the accumulator.
`scan_func`	A function that maps `(old_state, input_element)` to `(new_state, output_element)`. It must take two arguments and return a pair of nested structures of tensors. The `new_state` must match the structure of `initial_state`.
`name`	(Optional.) A name for the tf.data operation.

Args
`num_shards`	A `tf.int64` scalar `tf.Tensor`, representing the number of shards operating in parallel.
`index`	A `tf.int64` scalar `tf.Tensor`, representing the worker index.
`name`	(Optional.) A name for the tf.data operation.

Args
`path`	Required. A directory to use for storing / loading the snapshot to / from.
`compression`	Optional. The type of compression to apply to the snapshot written to disk. Supported options are `GZIP`, `SNAPPY`, `AUTO` or None. Defaults to `AUTO`, which attempts to pick an appropriate compression algorithm for the dataset.
`reader_func`	Optional. A function to control how to read data from snapshot shards.
`shard_func`	Optional. A function to control how to shard data when writing a snapshot.
`name`	(Optional.) A name for the tf.data operation.

Args
`predicate`	A function that maps a nested structure of tensors (having shapes and types defined by `self.output_shapes` and `self.output_types`) to a scalar `tf.bool` tensor.
`name`	(Optional.) A name for the tf.data operation.

Args
`size`	A `tf.int64` scalar `tf.Tensor`, representing the number of elements of the input dataset to combine into a window. Must be positive.
`shift`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the number of input elements by which the window moves in each iteration. Defaults to `size`. Must be positive.
`stride`	(Optional.) A `tf.int64` scalar `tf.Tensor`, representing the stride of the input elements in the sliding window. Must be positive. The default value of 1 means "retain every input element".
`drop_remainder`	(Optional.) A `tf.bool` scalar `tf.Tensor`, representing whether the last windows should be dropped if their size is smaller than `size`.
`name`	(Optional.) A name for the tf.data operation.

Args
`options`	A `tf.data.Options` that identifies the options the use.
`name`	(Optional.) A name for the tf.data operation.

Args
`datasets`	A (nested) structure of datasets.
`name`	(Optional.) A name for the tf.data operation.