Ta strona została przetłumaczona przez Cloud Translation API.

TFRecord i tf.train.Przykład

Zobacz na TensorFlow.org

Uruchom w Google Colab

Wyświetl źródło na GitHub

Pobierz notatnik

Format TFRecord to prosty format do przechowywania sekwencji rekordów binarnych.

Bufory protokołów to wieloplatformowa, wielojęzyczna biblioteka do wydajnej serializacji uporządkowanych danych.

Komunikaty protokołu są definiowane przez pliki .proto , są to często najłatwiejszy sposób zrozumienia typu komunikatu.

Komunikat tf.train.Example (lub protobuf) to elastyczny typ komunikatu, który reprezentuje mapowanie {"string": value} . Został zaprojektowany do użytku z TensorFlow i jest używany przez interfejsy API wyższego poziomu, takie jak TFX .

W tym notatniku pokazano, jak tworzyć, analizować i używać komunikatu tf.train.Example , a następnie serializować, zapisywać i odczytywać komunikaty tf.train.Example do iz plików .tfrecord .

Uwaga: Ogólnie rzecz biorąc, należy podzielić dane na wiele plików, aby można było zrównoleglać we/wy (w ramach jednego hosta lub na wielu hostach). Ogólną zasadą jest posiadanie co najmniej 10 razy większej liczby plików niż hostów odczytujących dane. Jednocześnie każdy plik powinien być wystarczająco duży (co najmniej 10 MB+, a najlepiej 100 MB+), aby można było skorzystać z wstępnego pobierania I/O. Załóżmy na przykład, że masz X GB danych i planujesz trenować na maksymalnie N hostach. W idealnym przypadku dane należy podzielić na ~ 10*N plików, o ile ~ X/(10*N) to 10 MB+ (najlepiej 100 MB+). Jeśli jest mniej, może być konieczne utworzenie mniejszej liczby fragmentów, aby skompensować korzyści z równoległości i korzyści z wstępnego pobierania we/wy.

Ustawiać

import tensorflow as tf

import numpy as np
import IPython.display as display

`tf.train.Example`

Typy danych dla `tf.train.Example`

Zasadniczo tf.train.Example to mapowanie {"string": tf.train.Feature} .

Typ komunikatu tf.train.Feature może akceptować jeden z następujących trzech typów (patrz plik .proto w celu uzyskania informacji). Większość innych typów ogólnych można zmusić do jednego z tych:

tf.train.BytesList (można wymusić następujące typy)
- string
- byte
tf.train.FloatList (można wymusić następujące typy)
- float ( float32 )
- double ( float64 )
tf.train.Int64List (można wymusić następujące typy)
- bool
- enum
- int32
- uint32
- int64
- uint64

Aby przekonwertować standardowy typ TensorFlow na tf.train.Example tf.train.Feature , możesz użyć poniższych funkcji skrótów. Zauważ, że każda funkcja pobiera skalarną wartość wejściową i zwraca tf.train.Feature zawierającą jeden z trzech powyższych typów list :

# The following functions can be used to convert a value to a type compatible
# with tf.train.Example.

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

Poniżej znajduje się kilka przykładów działania tych funkcji. Zwróć uwagę na różne typy wejść i standardowe typy wyjść. Jeśli typ danych wejściowych dla funkcji nie pasuje do jednego z typów koercjalnych wymienionych powyżej, funkcja zgłosi wyjątek (np _int64_feature(1.0) błąd, ponieważ 1.0 jest liczbą zmiennoprzecinkową — w związku z tym powinna być używana z funkcją _float_feature ):

print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}

float_list {
  value: 2.7182817459106445
}

int64_list {
  value: 1
}

int64_list {
  value: 1
}

Wszystkie komunikaty proto mogą być serializowane do ciągu binarnego przy użyciu metody .SerializeToString :

feature = _float_feature(np.exp(1))

feature.SerializeToString()

b'\x12\x06\n\x04T\xf8-@'

Tworzenie `tf.train.Example` wiadomość

Załóżmy, że chcesz utworzyć komunikat tf.train.Example z istniejących danych. W praktyce zbiór danych może pochodzić z dowolnego miejsca, ale procedura tworzenia tf.train.Example wiadomość z pojedynczej obserwacji będzie taka sama:

W ramach każdej obserwacji każda wartość musi zostać przekonwertowana na tf.train.Feature zawierającą jeden z 3 kompatybilnych typów przy użyciu jednej z powyższych funkcji.
Tworzysz mapę (słownik) z ciągu nazwy elementu do zakodowanej wartości elementu utworzonej w #1.
Mapa utworzona w kroku 2 zostanie przekonwertowana na komunikat Features .

W tym notatniku utworzysz zestaw danych za pomocą NumPy.

Ten zbiór danych będzie miał 4 funkcje:

funkcja logiczna, False lub True z równym prawdopodobieństwem
cecha całkowita jednolicie losowo wybrana z [0, 5]
funkcja ciągu wygenerowana z tabeli ciągów przy użyciu funkcji liczb całkowitych jako indeksu
funkcja zmiennoprzecinkowa ze standardowego rozkładu normalnego

Rozważmy próbkę składającą się z 10 000 niezależnie i identycznie rozłożonych obserwacji z każdego z powyższych rozkładów:

# The number of observations in the dataset.
n_observations = int(1e4)

# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_observations)

# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)

# String feature.
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# Float feature, from a standard normal distribution.
feature3 = np.random.randn(n_observations)

Każdą z tych funkcji można przekształcić w typ zgodny z tf.train.Example przy użyciu jednego z _bytes_feature , _float_feature , _int64_feature . Następnie możesz utworzyć wiadomość tf.train.Example z tych zakodowanych funkcji:

def serialize_example(feature0, feature1, feature2, feature3):
  """
  Creates a tf.train.Example message ready to be written to a file.
  """
  # Create a dictionary mapping the feature name to the tf.train.Example-compatible
  # data type.
  feature = {
      'feature0': _int64_feature(feature0),
      'feature1': _int64_feature(feature1),
      'feature2': _bytes_feature(feature2),
      'feature3': _float_feature(feature3),
  }

  # Create a Features message using tf.train.Example.

  example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
  return example_proto.SerializeToString()

Załóżmy na przykład, że masz pojedynczą obserwację ze zbioru danych, [False, 4, bytes('goat'), 0.9876] . Możesz utworzyć i wydrukować komunikat tf.train.Example dla tej obserwacji za pomocą funkcji create_message() . Każda pojedyncza obserwacja zostanie zapisana jako komunikat Features zgodnie z powyższym. Zauważ, że komunikat tf.train.Example jest tylko opakowaniem wokół komunikatu Features :

# This is an example observation from the dataset.

example_observation = []

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?'

Aby zdekodować wiadomość, użyj metody tf.train.Example.FromString .

example_proto = tf.train.Example.FromString(serialized_example)
example_proto

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}

Szczegóły formatu TFRecords

Plik TFRecord zawiera sekwencję rekordów. Plik można odczytać tylko sekwencyjnie.

Każdy rekord zawiera ciąg bajtów dla ładunku danych oraz długość danych, a także skróty CRC-32C ( 32-bitowe CRC przy użyciu wielomianu Castagnoli ) służące do sprawdzania integralności.

Każdy rekord jest przechowywany w następujących formatach:

uint64 length
uint32 masked_crc32_of_length
byte   data[length]
uint32 masked_crc32_of_data

Rekordy są łączone w celu utworzenia pliku. CRC są opisane tutaj , a maska CRC to:

masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul

Pliki TFRecord używające `tf.data`

Moduł tf.data zapewnia również narzędzia do odczytywania i zapisywania danych w TensorFlow.

Zapisywanie pliku TFRecord

Najłatwiejszym sposobem wprowadzenia danych do zestawu danych jest użycie metody from_tensor_slices .

Zastosowany do tablicy, zwraca zbiór danych skalarów:

tf.data.Dataset.from_tensor_slices(feature1)

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

Zastosowany do krotki tablic, zwraca zbiór danych składający się z krotek:

features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))
features_dataset

<TensorSliceDataset element_spec=(TensorSpec(shape=(), dtype=tf.bool, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.float64, name=None))>

# Use `take(1)` to only pull one example from the dataset.
for f0,f1,f2,f3 in features_dataset.take(1):
  print(f0)
  print(f1)
  print(f2)
  print(f3)

tf.Tensor(False, shape=(), dtype=bool)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'goat', shape=(), dtype=string)
tf.Tensor(0.5251196235602504, shape=(), dtype=float64)

Użyj metody tf.data.Dataset.map , aby zastosować funkcję do każdego elementu Dataset .

Zmapowana funkcja musi działać w trybie wykresu TensorFlow — musi działać i zwracać tf.Tensors . Funkcja bez tensora, taka jak serialize_example , może być opakowana za pomocą tf.py_function , aby była zgodna.

Użycie tf.py_function wymaga podania informacji o kształcie i typie, które w innym przypadku są niedostępne:

def tf_serialize_example(f0,f1,f2,f3):
  tf_string = tf.py_function(
    serialize_example,
    (f0, f1, f2, f3),  # Pass these args to the above function.
    tf.string)      # The return type is `tf.string`.
  return tf.reshape(tf_string, ()) # The result is a scalar.

tf_serialize_example(f0, f1, f2, f3)

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>

Zastosuj tę funkcję do każdego elementu w zbiorze danych:

serialized_features_dataset = features_dataset.map(tf_serialize_example)
serialized_features_dataset

<MapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

def generator():
  for features in features_dataset:
    yield serialize_example(*features)

serialized_features_dataset = tf.data.Dataset.from_generator(
    generator, output_types=tf.string, output_shapes=())

serialized_features_dataset

<FlatMapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

I zapisz je do pliku TFRecord:

filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(serialized_features_dataset)

WARNING:tensorflow:From /tmp/ipykernel_25215/3575438268.py:2: TFRecordWriter.__init__ (from tensorflow.python.data.experimental.ops.writers) is deprecated and will be removed in a future version.
Instructions for updating:
To write TFRecords to disk, use `tf.io.TFRecordWriter`. To save and load the contents of a dataset, use `tf.data.experimental.save` and `tf.data.experimental.load`

Czytanie pliku TFRecord

Plik TFRecord można również odczytać za pomocą klasy tf.data.TFRecordDataset .

Więcej informacji na temat korzystania z plików TFRecord przy użyciu tf.data można znaleźć w przewodniku tf.data: Build TensorFlow input pipelines guide.

Używanie TFRecordDataset s może być przydatne do standaryzacji danych wejściowych i optymalizacji wydajności.

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

W tym momencie zbiór danych zawiera zserializowane komunikaty tf.train.Example . Po iteracji zwraca je jako tensory łańcuchów skalarnych.

Użyj metody .take , aby wyświetlić tylko pierwszych 10 rekordów.

for raw_record in raw_dataset.take(10):
  print(repr(raw_record))

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x9d\xfa\x98\xbe\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03dog\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04a\xc0r?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x92Q(?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04>\xc0\xe5>\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nU\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04I!\xde\xbe\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x02\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x17\n\x08feature2\x12\x0b\n\t\n\x07chicken'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\xe0\x1a\xab\xbf\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x87\xb2\xd7?\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04n\xe19>\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x1as\xd9\xbf\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>

Te tensory można analizować za pomocą poniższej funkcji. Zwróć uwagę, że opis funkcji jest tutaj niezbędny, ponieważ feature_description s używają tf.data.Dataset wykresu i potrzebują tego opisu, aby zbudować swój kształt i sygnaturę typu:

# Create a description of the features.
feature_description = {
    'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}

def _parse_function(example_proto):
  # Parse the input `tf.train.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, feature_description)

Alternatywnie użyj tf.parse example , aby przeanalizować całą partię naraz. Zastosuj tę funkcję do każdego elementu w zbiorze danych za pomocą metody tf.data.Dataset.map :

parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset

<MapDataset element_spec={'feature0': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature1': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature2': TensorSpec(shape=(), dtype=tf.string, name=None), 'feature3': TensorSpec(shape=(), dtype=tf.float32, name=None)}>

Użyj szybkiego wykonania, aby wyświetlić obserwacje w zestawie danych. W tym zbiorze danych znajduje się 10 000 obserwacji, ale wyświetlisz tylko pierwszych 10. Dane są wyświetlane jako słownik cech. Każdy element to tf.Tensor , a element numpy tego tensora wyświetla wartość funkcji:

for parsed_record in parsed_dataset.take(10):
  print(repr(parsed_record))

{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.5251196>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.29878703>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'dog'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.94824797>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.65749466>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.44873232>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'chicken'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.4338477>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.3367577>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=1.6851357>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.18152401>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.6988251>}

Tutaj funkcja tf.parse_example rozpakowuje pola tf.train.Example do standardowych tensorów.

Pliki TFRecord w Pythonie

Moduł tf.io zawiera również czyste funkcje Pythona do odczytu i zapisu plików TFRecord.

Zapisywanie pliku TFRecord

Następnie zapisz 10 000 obserwacji do pliku test.tfrecord . Każda obserwacja jest konwertowana na komunikat tf.train.Example , a następnie zapisywana do pliku. Następnie możesz sprawdzić, czy plik test.tfrecord został utworzony:

# Write the `tf.train.Example` observations to the file.
with tf.io.TFRecordWriter(filename) as writer:
  for i in range(n_observations):
    example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])
    writer.write(example)

du -sh {filename}

984K    test.tfrecord

Czytanie pliku TFRecord

Te serializowane tensory można łatwo przeanalizować za pomocą tf.train.Example.ParseFromString :

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

for raw_record in raw_dataset.take(1):
  example = tf.train.Example()
  example.ParseFromString(raw_record.numpy())
  print(example)

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.5251196026802063
      }
    }
  }
}

Zwraca to proto tf.train.Example , który jest trudny w użyciu, ale jest zasadniczo reprezentacją:

Dict[str,
     Union[List[float],
           List[int],
           List[str]]]

Poniższy kod ręcznie konwertuje Example na słownik tablic NumPy, bez użycia TensorFlow Ops. Więcej informacji można znaleźć w pliku PROTO .

result = {}
# example.features.feature is the dictionary
for key, feature in example.features.feature.items():
  # The values are the Feature objects which contain a `kind` which contains:
  # one of three fields: bytes_list, float_list, int64_list

  kind = feature.WhichOneof('kind')
  result[key] = np.array(getattr(feature, kind).value)

result

{'feature3': array([0.5251196]),
 'feature1': array([4]),
 'feature0': array([0]),
 'feature2': array([b'goat'], dtype='|S4')}

Przewodnik: odczytywanie i zapisywanie danych obrazu

To jest kompletny przykład, jak czytać i zapisywać dane obrazu za pomocą TFRecords. Używając obrazu jako danych wejściowych, zapiszesz dane jako plik TFRecord, a następnie odczytasz plik i wyświetlisz obraz.

Może to być przydatne, jeśli na przykład chcesz użyć kilku modeli na tym samym wejściowym zbiorze danych. Zamiast przechowywać surowe dane obrazu, można je wstępnie przetworzyć do formatu TFRecords, który można wykorzystać we wszystkich dalszych procesach przetwarzania i modelowania.

Najpierw pobierzmy to zdjęcie kota na śniegu i to zdjęcie mostu Williamsburg, NYC w budowie.

Pobierz obrazy

cat_in_snow  = tf.keras.utils.get_file(
    '320px-Felis_catus-cat_on_snow.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')

williamsburg_bridge = tf.keras.utils.get_file(
    '194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg
24576/17858 [=========================================] - 0s 0us/step
32768/17858 [=======================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg
16384/15477 [===============================] - 0s 0us/step
24576/15477 [===============================================] - 0s 0us/step

display.display(display.Image(filename=cat_in_snow))
display.display(display.HTML('Image cc-by: <a "href=https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg">Von.grzanka</a>'))

JPEG

display.display(display.Image(filename=williamsburg_bridge))
display.display(display.HTML('<a "href=https://commons.wikimedia.org/wiki/File:New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg">From Wikimedia</a>'))

JPEG

Zapisz plik TFRecord

Tak jak poprzednio, zakoduj funkcje jako typy zgodne z tf.train.Example . To przechowuje funkcję nieprzetworzonego ciągu obrazu, a także wysokość, szerokość, głębokość i dowolną funkcję label . Ten ostatni jest używany podczas pisania pliku, aby odróżnić obraz kota od obrazu mostu. Użyj 0 dla obrazu kota i 1 dla obrazu mostu:

image_labels = {
    cat_in_snow : 0,
    williamsburg_bridge : 1,
}

# This is an example, just using the cat image.
image_string = open(cat_in_snow, 'rb').read()

label = image_labels[cat_in_snow]

# Create a dictionary with features that may be relevant.
def image_example(image_string, label):
  image_shape = tf.io.decode_jpeg(image_string).shape

  feature = {
      'height': _int64_feature(image_shape[0]),
      'width': _int64_feature(image_shape[1]),
      'depth': _int64_feature(image_shape[2]),
      'label': _int64_feature(label),
      'image_raw': _bytes_feature(image_string),
  }

  return tf.train.Example(features=tf.train.Features(feature=feature))

for line in str(image_example(image_string, label)).split('\n')[:15]:
  print(line)
print('...')

features {
  feature {
    key: "depth"
    value {
      int64_list {
        value: 3
      }
    }
  }
  feature {
    key: "height"
    value {
      int64_list {
        value: 213
      }
...

Zauważ, że wszystkie funkcje są teraz przechowywane w komunikacie tf.train.Example . Następnie sfunkcjonalizuj powyższy kod i zapisz przykładowe komunikaty do pliku o nazwie images.tfrecords :

# Write the raw image files to `images.tfrecords`.
# First, process the two images into `tf.train.Example` messages.
# Then, write to a `.tfrecords` file.
record_file = 'images.tfrecords'
with tf.io.TFRecordWriter(record_file) as writer:
  for filename, label in image_labels.items():
    image_string = open(filename, 'rb').read()
    tf_example = image_example(image_string, label)
    writer.write(tf_example.SerializeToString())

du -sh {record_file}

36K images.tfrecords

Przeczytaj plik TFRecord

Masz teraz plik — images.tfrecords — i możesz teraz przeglądać zapisane w nim rekordy, aby odczytać to, co napisałeś. Biorąc pod uwagę, że w tym przykładzie odtworzysz tylko obraz, jedyną potrzebną funkcją jest nieprzetworzony ciąg obrazu. Wyodrębnij go za pomocą opisanych powyżej metod pobierania, a mianowicie example.features.feature['image_raw'].bytes_list.value[0] . Możesz również użyć etykiet, aby określić, który rekord jest kotem, a który mostem:

raw_image_dataset = tf.data.TFRecordDataset('images.tfrecords')

# Create a dictionary describing the features.
image_feature_description = {
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'depth': tf.io.FixedLenFeature([], tf.int64),
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.FixedLenFeature([], tf.string),
}

def _parse_image_function(example_proto):
  # Parse the input tf.train.Example proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, image_feature_description)

parsed_image_dataset = raw_image_dataset.map(_parse_image_function)
parsed_image_dataset

<MapDataset element_spec={'depth': TensorSpec(shape=(), dtype=tf.int64, name=None), 'height': TensorSpec(shape=(), dtype=tf.int64, name=None), 'image_raw': TensorSpec(shape=(), dtype=tf.string, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'width': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

Odzyskaj obrazy z pliku TFRecord:

for image_features in parsed_image_dataset:
  image_raw = image_features['image_raw'].numpy()
  display.display(display.Image(data=image_raw))

JPEG

TFRecord i tf.train.Przykład

Ustawiać

tf.train.Example

Typy danych dla tf.train.Example

Tworzenie tf.train.Example wiadomość

Szczegóły formatu TFRecords

Pliki TFRecord używające tf.data

Zapisywanie pliku TFRecord

Czytanie pliku TFRecord

Pliki TFRecord w Pythonie

Zapisywanie pliku TFRecord

Czytanie pliku TFRecord

Przewodnik: odczytywanie i zapisywanie danych obrazu

Pobierz obrazy

Zapisz plik TFRecord

Przeczytaj plik TFRecord

`tf.train.Example`

Typy danych dla `tf.train.Example`

Tworzenie `tf.train.Example` wiadomość

Pliki TFRecord używające `tf.data`