TFDS はCroissant 🥐 形式をサポートするようになりました。詳細については、ドキュメントをお読みください。

TFDS と決定論

TensorFlow.org で表示

Google Colab で実行

GitHub でソースを表示

ノートブックをダウンロード

このドキュメントでは、以下について説明します。

TFDS は決定論を保証する
TFDS が例を読み取る順序
さまざまな警告と落とし穴

MNIST モデルをビルドする

データセット

TFDS がデータを読み取る仕組みを理解するには、何らかのこんてきすとが必要です。

TFDS は生成中に、元のデータを標準化された .tfrecord ファイルに書き込みます。大型のデータセットの場合、複数の .tfrecord ファイルが作成され、ファイルごとに複数の Example が含められます。これらの .tfrecord ファイルはそれぞれシャードと呼ばれています。

このガイドでは、1024 個のシャードを持つ imagenet を使用します。

import re
import tensorflow_datasets as tfds

imagenet = tfds.builder('imagenet2012')

num_shards = imagenet.info.splits['train'].num_shards
num_examples = imagenet.info.splits['train'].num_examples
print(f'imagenet has {num_shards} shards ({num_examples} examples)')

imagenet has 1024 shards (1281167 examples)

データセットの Example ID を特定する

決定論についてのみ関心がある場合は、次のセクションにスキップできます。

各データセットの Example は、id によって一意に識別されています（例: 'imagenet2012-train.tfrecord-01023-of-01024__32'）。この id は、read_config.add_tfds_id = True によって回復できます。これにより、tf.data.Dataset からの dict に 'tfds_id' キーが追加されます。

このチュートリアルでは、データセットの Example ID を出力する小さな util を定義します（人間が読めるように数値に変換します）。

def load_dataset(builder, **as_dataset_kwargs):
  """Load the dataset with the tfds_id."""
  read_config = as_dataset_kwargs.pop('read_config', tfds.ReadConfig())
  read_config.add_tfds_id = True  # Set `True` to return the 'tfds_id' key
  return builder.as_dataset(read_config=read_config, **as_dataset_kwargs)

def print_ex_ids(
    builder,
    *,
    take: int,
    skip: int = None,
    **as_dataset_kwargs,
) -> None:
  """Print the example ids from the given dataset split."""
  ds = load_dataset(builder, **as_dataset_kwargs)
  if skip:
    ds = ds.skip(skip)
  ds = ds.take(take)
  exs = [ex['tfds_id'].numpy().decode('utf-8') for ex in ds]
  exs = [id_to_int(tfds_id, builder=builder) for tfds_id in exs]
  print(exs)

def id_to_int(tfds_id: str, builder) -> str:
  """Format the tfds_id in a more human-readable."""
  match = re.match(r'\w+-(\w+).\w+-(\d+)-of-\d+__(\d+)', tfds_id)
  split_name, shard_id, ex_id = match.groups()
  split_info = builder.info.splits[split_name]
  return sum(split_info.shard_lengths[:int(shard_id)]) + int(ex_id)

読み取る際の決定論

このセクションでは、tfds.load の決定論的保証を説明します。

`shuffle_files=False` を使用する（デフォルト）

デフォルトでは、TFDS は決定論的に Example を生成します（shuffle_files=False）。

# Same as: imagenet.as_dataset(split='train').take(20)
print_ex_ids(imagenet, split='train', take=20)
print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

パフォーマンスについては、TFDS は tf.data.Dataset.interleave を使用して同時に複数のシャードを読み取ります。この例では、TFDS が 16 個の Example（..., 14, 15, 1251, 1252, ...）を読み取った後に、シャード 2 に切り替えているのがわかります。（..., 14, 15, 1251, 1252, ...）。インターリーブについて以下をご覧ください。

同様に、subsplit API も決定論的です。

print_ex_ids(imagenet, split='train[67%:84%]', take=20)
print_ex_ids(imagenet, split='train[67%:84%]', take=20)

[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]
[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]

2 エポック以上をトレーニングしている場合、すべてのエポックが同じ順序でシャードを読み取るため、上記のセットアップは推奨されません（つまりランダム性は、ds = ds.shuffle(buffer) バッファサイズに制限されています）。

`shuffle_files=True` を使用する

shuffle_files=True を使用すると、シャードはエポックごとにシャッフルされるため、読み取りは決定論的でなくなってしまいます。

print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)
print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)

[568017, 329050, 329051, 329052, 329053, 329054, 329056, 329055, 568019, 568020, 568021, 568022, 568023, 568018, 568025, 568024, 568026, 568028, 568030, 568031]
[43790, 43791, 43792, 43793, 43796, 43794, 43797, 43798, 43795, 43799, 43800, 43801, 43802, 43803, 43804, 43805, 43806, 43807, 43809, 43810]

注意: shuffle_files=True に設定することでも、パフォーマンスを促進するために、tf.data.Options で deterministic が無効化されます。そのため、シャードが 1 つしかないような小さなデータセット（mnist など）であっても、非決定論的になります。

決定論的ファイルをシャッフルするには、以下のレシピをご覧ください。

決定論の注意事項: インターリーブ引数

read_config.interleave_cycle_length を変更すると、read_config.interleave_block_length によって Example の順序が変わります。

TFDS は tf.data.Dataset.interleave を使用して、一度に読み込むシャード数を少なくし、パフォーマンスの改善とメモリ使用率の低減を行っています。

Example の順序は、インターリーブ引数の固定値に対してのみ同じであることが保証されています。どの cycle_length と block_length が対応しているかも知るには、インターリーブのドキュメントをご覧ください。

cycle_length=16、block_length=16（デフォルト、上記と同じ）:

print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

cycle_length=3、block_length=2:

read_config = tfds.ReadConfig(
    interleave_cycle_length=3,
    interleave_block_length=2,
)
print_ex_ids(imagenet, split='train', read_config=read_config, take=20)

[0, 1, 1251, 1252, 2502, 2503, 2, 3, 1253, 1254, 2504, 2505, 4, 5, 1255, 1256, 2506, 2507, 6, 7]

2 つ目の例では、データセットがシャード内の 2 つの Example（block_length=2）を読み取ってから次のシャードに切り替えていることがわかります。2 x 3（cycle_length=3）Example ごとに、最初のシャードに戻ります（shard0-ex0、shard0-ex1、shard1-ex0、shard1-ex1、shard2-ex0、shard2-ex1、shard0-ex2、shard0-ex3、shard1-ex2、shard1-ex3、shard2-ex2、など）。

Subsplit と Example の順序

各 Example には id 0, 1, ..., num_examples-1 があります。subsplit API は、Example のスライス（例: train[:x] select 0, 1, ..., x-1）を選択します。

ただし、Subsplit の中では、Example は ID の昇順には読み取られません（シャードとインターリーブのため）。

より具体的には、ds.take(x) と split='train[:x]' は同等ではありません！

このことは、Example が様々なシャードから取得される上記のインターリーブの例で簡単に確認できます。

print_ex_ids(imagenet, split='train', take=25)  # tfds.load(..., split='train').take(25)
print_ex_ids(imagenet, split='train[:25]', take=-1)  # tfds.load(..., split='train[:25]')

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

16（block_length）の Example の後、train[:25] が最初のシャードの Example を読み取り続ける間、.take(25) は次のシャードに切り替えます。

レシピ

決定論的ファイルシャッフル

決定的シャッフルを行うには 2 つの方法があります。

shuffle_seed を設定する方法。注意: これにはエポックごとにシードを変更する必要があります。変更しない場合、シャードは、エポックごとに同じ順序で読み取られてしまいます。

read_config = tfds.ReadConfig(
    shuffle_seed=32,
)

# Deterministic order, different from the default shuffle_files=False above
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)

[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]
[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]

experimental_interleave_sort_fn を使用する方法: この場合、ds.shuffle の順序に依存せずに、どのシャードがどの順序で読み取られるかを完全に制御できます。

def _reverse_order(file_instructions):
  return list(reversed(file_instructions))

read_config = tfds.ReadConfig(
    experimental_interleave_sort_fn=_reverse_order,
)

# Last shard (01023-of-01024) is read first
print_ex_ids(imagenet, split='train', read_config=read_config, take=5)

[1279916, 1279917, 1279918, 1279919, 1279920]

決定論的プリエンプティブルパイプライン

これはより複雑なレシピです。簡単で満足のいくソリューションはありません。

ds.shuffle を使用せず、決定論的シャッフルを使用すると、理論的には、読み取られた Example をカウントし、どの Example が書くシャード内で読み取られたか（関数 cycle_length、block_length、およびシャード順）を演繹することは可能です。その後に、skip と各シャードの take を experimental_interleave_sort_fn を介して注入することができます。
ds.shuffle を使用した場合、完全なトレーニングパイプラインを再生せずにはほぼ不可能です。どの Example が読み取られたかを演繹するには、ds.shuffle バッファの状態を保存する必要があります。Example は非連続的（たとえばshard5_ex2, shard5_ex4 が読み取られても shard5_ex3 は読み取られないなど）となる可能性があります。.
ds.shuffle を使用した場合、読み取られたすべての shards_ids/example_ids（tfds_id から演繹）を保存し、そのからファイルの命令を演繹する方法が考えられます。

1. の最も単純なケースは、.skip(x).take(y) を train[x:x+y] をマッチさせることです。これには以下が必要となります。

cycle_length=1 を設定する（シャードが順次読み取られるように）
shuffle_files=False を設定する
ds.shuffle を使用しない

トレーニングが 1 エポックだけの大型のデータセットでのみ使用することをお勧めします。Example はデフォルトのシャッフル順に読み取られます。

read_config = tfds.ReadConfig(
    interleave_cycle_length=1,  # Read shards sequentially
)

print_ex_ids(imagenet, split='train', read_config=read_config, skip=40, take=22)
# If the job get pre-empted, using the subsplit API will skip at most `len(shard0)`
print_ex_ids(imagenet, split='train[40:]', read_config=read_config, take=22)

[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]

特定の Subsplit でどのシャード/Example が読み取られたかを調べる

tfds.core.DatasetInfo を使うと、読み取り命令に直接アクセスできます。

imagenet.info.splits['train[44%:45%]'].file_instructions

[FileInstruction(filename='imagenet2012-train.tfrecord-00450-of-01024', skip=700, take=-1, num_examples=551),
 FileInstruction(filename='imagenet2012-train.tfrecord-00451-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00452-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00453-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00454-of-01024', skip=0, take=-1, num_examples=1252),
 FileInstruction(filename='imagenet2012-train.tfrecord-00455-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00456-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00457-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00458-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00459-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00460-of-01024', skip=0, take=1001, num_examples=1001)]