टीएफडीएस अब क्रोइसैन 🥐 प्रारूप का समर्थन करता है! अधिक जानने के लिए दस्तावेज़ पढ़ें.

इस पेज का अनुवाद Cloud Translation API से किया गया है.

TFDS और नियतत्ववाद

TensorFlow.org पर देखें

Google Colab में चलाएं

गिटहब पर देखें

नोटबुक डाउनलोड करें

यह दस्तावेज़ बताता है:

TFDS नियतत्ववाद की गारंटी देता है
TFDS किस क्रम में उदाहरण पढ़ता है
विभिन्न चेतावनी और गठजोड़

सेट अप

डेटासेट

TFDS डेटा को कैसे पढ़ता है, इसे समझने के लिए कुछ संदर्भ की आवश्यकता है।

पीढी के दौरान, TFDS मानकीकृत में मूल डेटा लिखने .tfrecord फ़ाइलें। बड़े डेटासेट के लिए, कई .tfrecord फ़ाइलों को बनाने से, प्रत्येक कई उदाहरण हैं। हम फोन प्रत्येक .tfrecord एक ठीकरा फ़ाइल।

यह मार्गदर्शिका इमेजनेट का उपयोग करती है जिसमें 1024 शार्क हैं:

import re
import tensorflow_datasets as tfds

imagenet = tfds.builder('imagenet2012')

num_shards = imagenet.info.splits['train'].num_shards
num_examples = imagenet.info.splits['train'].num_examples
print(f'imagenet has {num_shards} shards ({num_examples} examples)')

imagenet has 1024 shards (1281167 examples)

डेटासेट उदाहरण आईडी ढूँढना

यदि आप केवल नियतत्ववाद के बारे में जानना चाहते हैं तो आप निम्न अनुभाग पर जा सकते हैं।

प्रत्येक डाटासेट उदाहरण विशिष्ट एक से पहचाना जाता है id (जैसे 'imagenet2012-train.tfrecord-01023-of-01024__32' )। आप इस ठीक हो सकता है id पारित करके read_config.add_tfds_id = True जो एक जोड़ देगा 'tfds_id' से dict में महत्वपूर्ण tf.data.Dataset ।

इस ट्यूटोरियल में, हम एक छोटे से उपयोग को परिभाषित करते हैं जो डेटासेट के उदाहरण आईडी को प्रिंट करेगा (अधिक मानव-पठनीय होने के लिए पूर्णांक में परिवर्तित):

def load_dataset(builder, **as_dataset_kwargs):
  """Load the dataset with the tfds_id."""
  read_config = as_dataset_kwargs.pop('read_config', tfds.ReadConfig())
  read_config.add_tfds_id = True  # Set `True` to return the 'tfds_id' key
  return builder.as_dataset(read_config=read_config, **as_dataset_kwargs)

def print_ex_ids(
    builder,
    *,
    take: int,
    skip: int = None,
    **as_dataset_kwargs,
) -> None:
  """Print the example ids from the given dataset split."""
  ds = load_dataset(builder, **as_dataset_kwargs)
  if skip:
    ds = ds.skip(skip)
  ds = ds.take(take)
  exs = [ex['tfds_id'].numpy().decode('utf-8') for ex in ds]
  exs = [id_to_int(tfds_id, builder=builder) for tfds_id in exs]
  print(exs)

def id_to_int(tfds_id: str, builder) -> str:
  """Format the tfds_id in a more human-readable."""
  match = re.match(r'\w+-(\w+).\w+-(\d+)-of-\d+__(\d+)', tfds_id)
  split_name, shard_id, ex_id = match.groups()
  split_info = builder.info.splits[split_name]
  return sum(split_info.shard_lengths[:int(shard_id)]) + int(ex_id)

पढ़ते समय नियतत्ववाद

इस धारा के deterministim गारंटी बताते tfds.load ।

साथ `shuffle_files=False` (डिफ़ॉल्ट)

डिफ़ॉल्ट TFDS निर्धारणात्मक उदाहरण उपज तक ( shuffle_files=False )

# Same as: imagenet.as_dataset(split='train').take(20)
print_ex_ids(imagenet, split='train', take=20)
print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

प्रदर्शन के लिए, TFDS का उपयोग कर एक ही समय में एक से अधिक टुकड़े पढ़ tf.data.Dataset.interleave । हम इस उदाहरण में देखते हैं कि TFDS 16 उदाहरण पढ़ने के बाद ठीकरा 2 करने के लिए स्विच ( ..., 14, 15, 1251, 1252, ... )। इंटरलीव बेलो पर अधिक।

इसी तरह, सबस्प्लिट एपीआई भी नियतात्मक है:

print_ex_ids(imagenet, split='train[67%:84%]', take=20)
print_ex_ids(imagenet, split='train[67%:84%]', take=20)

[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]
[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]

यदि आप एक से अधिक युग लिए हैं प्रशिक्षण, ऊपर सेटअप अनुशंसित नहीं है के रूप में सभी अवधियों को उसी क्रम में टुकड़े दिखाया जाएगा (ताकि अनियमितता के लिए सीमित है ds = ds.shuffle(buffer) आकार बफ़र)।

साथ `shuffle_files=True`

साथ shuffle_files=True , टुकड़े प्रत्येक युग के लिए, shuffled हैं तो पढ़ने अब और निर्धारित करने योग्य नहीं है।

print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)
print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)

[568017, 329050, 329051, 329052, 329053, 329054, 329056, 329055, 568019, 568020, 568021, 568022, 568023, 568018, 568025, 568024, 568026, 568028, 568030, 568031]
[43790, 43791, 43792, 43793, 43796, 43794, 43797, 43798, 43795, 43799, 43800, 43801, 43802, 43803, 43804, 43805, 43806, 43807, 43809, 43810]

नियतात्मक फ़ाइल फेरबदल प्राप्त करने के लिए नीचे नुस्खा देखें।

नियतत्ववाद चेतावनी: इंटरलीव args

बदल रहा है read_config.interleave_cycle_length , read_config.interleave_block_length उदाहरण क्रम बदल जाएगा।

TFDS पर निर्भर करता है tf.data.Dataset.interleave केवल एक ही बार में कुछ टुकड़े लोड करने के लिए, प्रदर्शन में सुधार और स्मृति के उपयोग को कम करने।

उदाहरण आदेश केवल इंटरलीव आर्ग के निश्चित मान के लिए समान होने की गारंटी है। देखें Interleave डॉक क्या समझने के लिए cycle_length और block_length अनुरूप भी।

cycle_length=16 , block_length=16 (डिफ़ॉल्ट, ऊपर के रूप में ही):

print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

cycle_length=3 , block_length=2 :

read_config = tfds.ReadConfig(
    interleave_cycle_length=3,
    interleave_block_length=2,
)
print_ex_ids(imagenet, split='train', read_config=read_config, take=20)

[0, 1, 1251, 1252, 2502, 2503, 2, 3, 1253, 1254, 2504, 2505, 4, 5, 1255, 1256, 2506, 2507, 6, 7]

दूसरे उदाहरण में, हम देखते हैं कि डाटासेट 2 (पढ़ block_length=2 एक ठीकरा में) उदाहरण, तो अगले ठीकरा पर स्विच करें। हर 2 * 3 ( cycle_length=3 ) उदाहरण, इसे वापस पहले ठीकरा (को जाता है shard0-ex0, shard0-ex1, shard1-ex0, shard1-ex1, shard2-ex0, shard2-ex1, shard0-ex2, shard0-ex3, shard1-ex2, shard1-ex3, shard2-ex2,... )।

सबस्प्लिट और उदाहरण क्रम

प्रत्येक उदाहरण एक आईडी है 0, 1, ..., num_examples-1 । Subsplit एपीआई (जैसे उदाहरण का एक टुकड़ा का चयन train[:x] का चयन 0, 1, ..., x-1 )।

हालांकि, सबस्प्लिट के भीतर, बढ़ते हुए आईडी क्रम में उदाहरण नहीं पढ़े जाते हैं (शार्क और इंटरलीव के कारण)।

विशेष रूप से, ds.take(x) और split='train[:x]' बराबर नहीं हैं!

इसे उपरोक्त इंटरलीव उदाहरण में आसानी से देखा जा सकता है जहां उदाहरण विभिन्न शार्क से आते हैं।

print_ex_ids(imagenet, split='train', take=25)  # tfds.load(..., split='train').take(25)
print_ex_ids(imagenet, split='train[:25]', take=-1)  # tfds.load(..., split='train[:25]')

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

16 (block_length) उदाहरण के बाद, .take(25) अगले ठीकरा पर स्विच करता है, जबकि train[:25] पहले ठीकरा से में उदाहरण पढ़ना जारी रखें।

व्यंजनों

नियतात्मक फ़ाइल फेरबदल प्राप्त करें

नियतात्मक फेरबदल करने के 2 तरीके हैं:

स्थापना shuffle_seed । नोट: इसके लिए प्रत्येक युग में बीज को बदलने की आवश्यकता होती है, अन्यथा युगों के बीच उसी क्रम में शार्क को पढ़ा जाएगा।

read_config = tfds.ReadConfig(
    shuffle_seed=32,
)

# Deterministic order, different from the default shuffle_files=False above
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)

[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]
[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]

का उपयोग करते हुए experimental_interleave_sort_fn : यह पूरा नियंत्रण है जिस पर टुकड़े पढ़ा जाता है और किस क्रम में, बल्कि पर भरोसा करने की बजाय देता ds.shuffle आदेश।

def _reverse_order(file_instructions):
  return list(reversed(file_instructions))

read_config = tfds.ReadConfig(
    experimental_interleave_sort_fn=_reverse_order,
)

# Last shard (01023-of-01024) is read first
print_ex_ids(imagenet, split='train', read_config=read_config, take=5)

[1279916, 1279917, 1279918, 1279919, 1279920]

नियतात्मक प्रीमेप्टेबल पाइपलाइन प्राप्त करें

यह एक अधिक जटिल है। कोई आसान, संतोषजनक समाधान नहीं है।

बिना ds.shuffle और नियतात्मक पुथल के साथ, सिद्धांत रूप में यह उदाहरण जो पढ़ा गया है और अनुमान है जो उदाहरण प्रत्येक ठीकरा में अंदर से पढ़ी गई गिनती करने के लिए (के एक समारोह के रूप में संभव हो जाना चाहिए cycle_length , block_length और ठीकरा क्रम)। तो फिर skip , take प्रत्येक ठीकरा के लिए के माध्यम से इंजेक्ट किया जा सकता है experimental_interleave_sort_fn ।
साथ ds.shuffle यह संभावना पूर्ण प्रशिक्षण पाइपलाइन पुनः के बिना असंभव है। यह बचत की आवश्यकता होगी ds.shuffle निकालना बफर राज्य में जो उदाहरण पढ़ा गया है। उदाहरण गैर निरंतर हो सकता है (उदाहरण के लिए shard5_ex2 , shard5_ex4 पढ़ नहीं बल्कि shard5_ex3 )।
साथ ds.shuffle , एक तरह से सभी shards_ids / example_ids (से निष्कर्ष निकाला पढ़ने को बचाने के लिए होगा tfds_id ,) तो उस से फ़ाइल निर्देश बात का अनुमान लगाना।

के लिए सबसे सामान्य स्थिति 1. करने के लिए है है .skip(x).take(y) मैच train[x:x+y] मैच। इसकी जरूरत है:

सेट cycle_length=1 (ताकि टुकड़े क्रमिक रूप से पढ़ रहे हैं)
सेट shuffle_files=False
का प्रयोग न करें ds.shuffle

इसका उपयोग केवल विशाल डेटासेट पर किया जाना चाहिए जहां प्रशिक्षण केवल 1 युग है। उदाहरणों को डिफ़ॉल्ट फेरबदल क्रम में पढ़ा जाएगा।

read_config = tfds.ReadConfig(
    interleave_cycle_length=1,  # Read shards sequentially
)

print_ex_ids(imagenet, split='train', read_config=read_config, skip=40, take=22)
# If the job get pre-empted, using the subsplit API will skip at most `len(shard0)`
print_ex_ids(imagenet, split='train[40:]', read_config=read_config, take=22)

[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]

पता लगाएं कि किसी दिए गए सबस्प्लिट के लिए कौन से शार्प/उदाहरण पढ़े जाते हैं

साथ tfds.core.DatasetInfo , आप पढ़ने के निर्देशों के सीधी पहुंच है।

imagenet.info.splits['train[44%:45%]'].file_instructions

[FileInstruction(filename='imagenet2012-train.tfrecord-00450-of-01024', skip=700, take=-1, num_examples=551),
 FileInstruction(filename='imagenet2012-train.tfrecord-00451-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00452-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00453-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00454-of-01024', skip=0, take=-1, num_examples=1252),
 FileInstruction(filename='imagenet2012-train.tfrecord-00455-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00456-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00457-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00458-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00459-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00460-of-01024', skip=0, take=1001, num_examples=1001)]