TFRecord และ tf.train.Example

ดูบน TensorFlow.org

ทำงานใน Google Colab

ดูแหล่งที่มาบน GitHub

ดาวน์โหลดโน๊ตบุ๊ค

รูปแบบ TFRecord เป็นรูปแบบที่เรียบง่ายสำหรับการจัดเก็บลำดับของเรคคอร์ดไบนารี

บัฟเฟอร์โปรโตคอล เป็นไลบรารีข้ามแพลตฟอร์มและข้ามภาษาสำหรับการจัดลำดับข้อมูลที่มีโครงสร้างอย่างมีประสิทธิภาพ

ข้อความโปรโตคอลถูกกำหนดโดยไฟล์ .proto ซึ่งมักจะเป็นวิธีที่ง่ายที่สุดในการทำความเข้าใจประเภทข้อความ

ข้อความ tf.train.Example (หรือ protobuf) เป็นประเภทข้อความที่ยืดหยุ่นซึ่งแสดงถึงการจับคู่ {"string": value} ได้รับการออกแบบมาเพื่อใช้กับ TensorFlow และใช้ทั่วทั้ง API ระดับสูงกว่า เช่น TFX

สมุดบันทึกนี้สาธิตวิธีสร้าง แยกวิเคราะห์ และใช้ข้อความ tf.train.Example จากนั้นเรียงลำดับ เขียน และอ่านข้อความ tf.train.Example ที่ส่งไปยังและจากไฟล์ . .tfrecord

หมายเหตุ: โดยทั่วไป คุณควรแบ่งข้อมูลของคุณในหลายๆ ไฟล์ เพื่อให้คุณสามารถขนาน I/O (ภายในโฮสต์เดียวหรือข้ามหลายโฮสต์) หลักการทั่วไปคือการมีไฟล์อย่างน้อย 10 เท่าของจำนวนไฟล์ที่มีโฮสต์อ่านข้อมูล ในเวลาเดียวกัน แต่ละไฟล์ควรมีขนาดใหญ่เพียงพอ (อย่างน้อย 10 MB+ และควรเป็น 100 MB+) เพื่อให้คุณได้รับประโยชน์จากการดึงข้อมูลล่วงหน้าของ I/O ตัวอย่างเช่น สมมติว่าคุณมีข้อมูล X GB และคุณวางแผนที่จะฝึกบนโฮสต์ N สูงสุด ตามหลักการแล้ว คุณควรแบ่งข้อมูลเป็นไฟล์ ~ 10*N ตราบใดที่ ~ X/(10*N) คือ 10 MB+ (และในอุดมคติคือ 100 MB+) หากน้อยกว่านั้น คุณอาจต้องสร้างชาร์ดน้อยลงเพื่อแลกกับผลประโยชน์แบบขนานและผลประโยชน์การดึงข้อมูล I/O ล่วงหน้า

ติดตั้ง

import tensorflow as tf

import numpy as np
import IPython.display as display

`tf.train.Example`

ชนิดข้อมูลสำหรับ `tf.train.Example`

โดยพื้นฐานแล้ว tf.train.Example คือการแมป {"string": tf.train.Feature}

ประเภทข้อความ tf.train.Feature สามารถยอมรับหนึ่งในสามประเภทต่อไปนี้ (ดู ไฟล์ .proto สำหรับการอ้างอิง) ยาสามัญอื่นๆ ส่วนใหญ่สามารถบังคับได้ดังนี้

tf.train.BytesList (ประเภทต่อไปนี้สามารถบังคับได้)
- string
- byte
tf.train.FloatList (ประเภทต่อไปนี้สามารถบังคับได้)
- float ( float32 )
- double ( float64 )
tf.train.Int64List (ประเภทต่อไปนี้สามารถบังคับได้)
- bool
- enum
- int32
- uint32
- int64
- uint64

ในการแปลงประเภท TensorFlow มาตรฐานเป็น tf.train.Example tf.train.Feature คุณสามารถใช้ฟังก์ชันทางลัดด้านล่าง โปรดทราบว่าแต่ละฟังก์ชันใช้ค่าอินพุตสเกลาร์และส่งกลับ tf.train.Feature ที่มีหนึ่งในสามประเภท list ด้านบน:

# The following functions can be used to convert a value to a type compatible
# with tf.train.Example.

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

ด้านล่างนี้คือตัวอย่างวิธีการทำงานของฟังก์ชันเหล่านี้ สังเกตประเภทอินพุตที่แตกต่างกันและประเภทเอาต์พุตมาตรฐาน หากประเภทอินพุตสำหรับฟังก์ชันไม่ตรงกับประเภทบังคับที่ระบุข้างต้น ฟังก์ชันจะทำให้เกิดข้อยกเว้น (เช่น _int64_feature(1.0) จะเกิดข้อผิดพลาดเนื่องจาก 1.0 เป็นแบบลอยตัว ดังนั้น จึงควรใช้กับฟังก์ชัน _float_feature แทน ):

print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}

float_list {
  value: 2.7182817459106445
}

int64_list {
  value: 1
}

int64_list {
  value: 1
}

ข้อความโปรโตทั้งหมดสามารถซีเรียลไลซ์เป็นไบนารีสตริงได้โดยใช้เมธอด .SerializeToString :

feature = _float_feature(np.exp(1))

feature.SerializeToString()

b'\x12\x06\n\x04T\xf8-@'

กำลังสร้าง `tf.train.Example` message

สมมติว่าคุณต้องการสร้างข้อความ tf.train.Example จากข้อมูลที่มีอยู่ ในทางปฏิบัติ ชุดข้อมูลอาจมาจากที่ใดก็ได้ แต่ขั้นตอนการสร้างข้อความ tf.train.Example จากการสังเกตครั้งเดียวจะเหมือนกัน:

ในแต่ละการสังเกต แต่ละค่าจะต้องถูกแปลงเป็น tf.train.Feature ที่มีหนึ่งใน 3 ประเภทที่เข้ากันได้ โดยใช้ฟังก์ชันใดฟังก์ชันหนึ่งข้างต้น
คุณสร้างแผนที่ (พจนานุกรม) จากสตริงชื่อสถานที่ไปจนถึงค่าคุณลักษณะที่เข้ารหัสซึ่งสร้างใน #1
แผนที่ที่สร้างในขั้นตอนที่ 2 จะถูกแปลงเป็น ข้อความ Features

ในสมุดบันทึกนี้ คุณจะสร้างชุดข้อมูลโดยใช้ NumPy

ชุดข้อมูลนี้จะมีคุณสมบัติ 4 ประการ:

คุณลักษณะบูลีน False หรือ True โดยมีความน่าจะเป็นเท่ากัน
คุณสมบัติจำนวนเต็มสุ่มเลือกอย่างสม่ำเสมอจาก [0, 5]
คุณลักษณะสตริงที่สร้างจากตารางสตริงโดยใช้คุณลักษณะจำนวนเต็มเป็นดัชนี
คุณลักษณะลอยจากการแจกแจงแบบปกติมาตรฐาน

พิจารณาตัวอย่างที่ประกอบด้วยการสังเกตที่แยกจากกันอย่างอิสระและเหมือนกัน 10,000 ครั้งจากการแจกแจงข้างต้นแต่ละอัน:

# The number of observations in the dataset.
n_observations = int(1e4)

# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_observations)

# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)

# String feature.
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# Float feature, from a standard normal distribution.
feature3 = np.random.randn(n_observations)

คุณลักษณะแต่ละอย่างเหล่านี้สามารถบังคับเป็น tf.train.Example โดยใช้ _bytes_feature , _float_feature , _int64_feature อย่างใดอย่างหนึ่ง จากนั้น คุณสามารถสร้างข้อความ tf.train.Example จากคุณลักษณะที่เข้ารหัสเหล่านี้ได้:

def serialize_example(feature0, feature1, feature2, feature3):
  """
  Creates a tf.train.Example message ready to be written to a file.
  """
  # Create a dictionary mapping the feature name to the tf.train.Example-compatible
  # data type.
  feature = {
      'feature0': _int64_feature(feature0),
      'feature1': _int64_feature(feature1),
      'feature2': _bytes_feature(feature2),
      'feature3': _float_feature(feature3),
  }

  # Create a Features message using tf.train.Example.

  example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
  return example_proto.SerializeToString()

ตัวอย่างเช่น สมมติว่าคุณมีข้อสังเกตเดียวจากชุดข้อมูล [False, 4, bytes('goat'), 0.9876] คุณสามารถสร้างและพิมพ์ข้อความ tf.train.Example สำหรับการสังเกตนี้โดยใช้ create_message() การสังเกตแต่ละครั้งจะถูกเขียนเป็นข้อความ Features ตามข้างต้น โปรดทราบว่า ข้อความ tf.train.Example เป็นเพียงเสื้อคลุมล้อมรอบข้อความ Features :

# This is an example observation from the dataset.

example_observation = []

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?'

ในการถอดรหัสข้อความ ให้ใช้เมธอด tf.train.Example.FromString

example_proto = tf.train.Example.FromString(serialized_example)
example_proto

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}

รายละเอียดรูปแบบ TFRecords

ไฟล์ TFRecord มีลำดับของเรคคอร์ด ไฟล์สามารถอ่านได้ตามลำดับเท่านั้น

แต่ละเร็กคอร์ดมีสตริงไบต์สำหรับ data-payload บวกกับความยาวข้อมูล และแฮช CRC-32C ( CRC 32 บิต โดยใช้ พหุนาม Castagnoli ) สำหรับการตรวจสอบความสมบูรณ์

แต่ละระเบียนจะถูกจัดเก็บในรูปแบบต่อไปนี้:

uint64 length
uint32 masked_crc32_of_length
byte   data[length]
uint32 masked_crc32_of_data

เร็กคอร์ดถูกต่อเข้าด้วยกันเพื่อสร้างไฟล์ มีการ อธิบาย CRCs ที่นี่ และมาสก์ของ CRC คือ:

masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul

หมายเหตุ: ไม่จำเป็นต้องใช้ tf.train.Example ในไฟล์ TFRecord tf.train.Example เป็นเพียงวิธีการจัดลำดับพจนานุกรมให้เป็นสตริงไบต์ ไบต์สตริงใด ๆ ที่สามารถถอดรหัสใน TensorFlow สามารถเก็บไว้ในไฟล์ TFRecord ตัวอย่าง ได้แก่ บรรทัดข้อความ JSON (โดยใช้ tf.io.decode_json_example ) ข้อมูลรูปภาพที่เข้ารหัส หรือ tf.Tensors อนุกรม (โดยใช้ tf.io.serialize_tensor / tf.io.parse_tensor ) ดูโมดูล tf.io สำหรับตัวเลือกเพิ่มเติม

ไฟล์ TFRecord โดยใช้ `tf.data`

โมดูล tf.data ยังมีเครื่องมือสำหรับการอ่านและเขียนข้อมูลใน TensorFlow

การเขียนไฟล์ TFRecord

วิธีที่ง่ายที่สุดในการรับข้อมูลลงในชุดข้อมูลคือการใช้วิธี from_tensor_slices

นำไปใช้กับอาร์เรย์ จะส่งคืนชุดข้อมูลของสเกลาร์:

tf.data.Dataset.from_tensor_slices(feature1)

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

นำไปใช้กับทูเพิลของอาร์เรย์ จะส่งคืนชุดข้อมูลของทูเพิล:

features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))
features_dataset

<TensorSliceDataset element_spec=(TensorSpec(shape=(), dtype=tf.bool, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.float64, name=None))>

# Use `take(1)` to only pull one example from the dataset.
for f0,f1,f2,f3 in features_dataset.take(1):
  print(f0)
  print(f1)
  print(f2)
  print(f3)

tf.Tensor(False, shape=(), dtype=bool)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'goat', shape=(), dtype=string)
tf.Tensor(0.5251196235602504, shape=(), dtype=float64)

ใช้เมธอด tf.data.Dataset.map เพื่อใช้ฟังก์ชันกับแต่ละองค์ประกอบของ Dataset

ฟังก์ชันที่แมปต้องทำงานในโหมดกราฟ TensorFlow โดยจะต้องเปิดและส่งคืน tf.Tensors ฟังก์ชันที่ไม่ใช่เทนเซอร์ เช่น serialize_example สามารถห่อด้วย tf.py_function เพื่อให้เข้ากันได้

การใช้ tf.py_function จำเป็นต้องระบุข้อมูลรูปร่างและประเภทที่ไม่พร้อมใช้งาน:

def tf_serialize_example(f0,f1,f2,f3):
  tf_string = tf.py_function(
    serialize_example,
    (f0, f1, f2, f3),  # Pass these args to the above function.
    tf.string)      # The return type is `tf.string`.
  return tf.reshape(tf_string, ()) # The result is a scalar.

tf_serialize_example(f0, f1, f2, f3)

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>

ใช้ฟังก์ชันนี้กับแต่ละองค์ประกอบในชุดข้อมูล:

serialized_features_dataset = features_dataset.map(tf_serialize_example)
serialized_features_dataset

<MapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

def generator():
  for features in features_dataset:
    yield serialize_example(*features)

serialized_features_dataset = tf.data.Dataset.from_generator(
    generator, output_types=tf.string, output_shapes=())

ตัวยึดตำแหน่ง

serialized_features_dataset

<FlatMapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

และเขียนลงในไฟล์ TFRecord:

filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(serialized_features_dataset)

WARNING:tensorflow:From /tmp/ipykernel_25215/3575438268.py:2: TFRecordWriter.__init__ (from tensorflow.python.data.experimental.ops.writers) is deprecated and will be removed in a future version.
Instructions for updating:
To write TFRecords to disk, use `tf.io.TFRecordWriter`. To save and load the contents of a dataset, use `tf.data.experimental.save` and `tf.data.experimental.load`

การอ่านไฟล์ TFRecord

คุณยังสามารถอ่านไฟล์ TFRecord โดยใช้คลาส tf.data.TFRecordDataset

ข้อมูลเพิ่มเติมเกี่ยวกับการใช้ไฟล์ TFRecord โดยใช้ tf.data สามารถพบได้ใน tf.data: คู่มือไพพ์ไลน์อินพุต Build TensorFlow

การใช้ TFRecordDataset สามารถเป็นประโยชน์สำหรับการกำหนดมาตรฐานข้อมูลอินพุตและเพิ่มประสิทธิภาพ

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

ตัวยึดตำแหน่ง33

ณ จุดนี้ ชุดข้อมูลมีข้อความ tf.train.Example ที่เป็นอนุกรม เมื่อวนซ้ำจะส่งกลับสิ่งเหล่านี้เป็นเทนเซอร์สตริงสเกลาร์

ใช้วิธีการ .take เพื่อแสดงเฉพาะ 10 รายการแรก

for raw_record in raw_dataset.take(10):
  print(repr(raw_record))

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x9d\xfa\x98\xbe\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03dog\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04a\xc0r?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x92Q(?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04>\xc0\xe5>\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nU\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04I!\xde\xbe\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x02\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x17\n\x08feature2\x12\x0b\n\t\n\x07chicken'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\xe0\x1a\xab\xbf\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x87\xb2\xd7?\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04n\xe19>\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x1as\xd9\xbf\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>

สามารถแยกวิเคราะห์เมตริกซ์เหล่านี้ได้โดยใช้ฟังก์ชันด้านล่าง โปรดทราบว่า feature_description มีความจำเป็นที่นี่ เนื่องจาก tf.data.Dataset ใช้การเรียกใช้กราฟ และต้องการคำอธิบายนี้เพื่อสร้างรูปร่างและประเภทลายเซ็น:

# Create a description of the features.
feature_description = {
    'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}

def _parse_function(example_proto):
  # Parse the input `tf.train.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, feature_description)

หรือใช้ tf.parse example เพื่อแยกวิเคราะห์ทั้งชุดพร้อมกัน ใช้ฟังก์ชันนี้กับแต่ละรายการในชุดข้อมูลโดยใช้เมธอด tf.data.Dataset.map :

parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset

<MapDataset element_spec={'feature0': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature1': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature2': TensorSpec(shape=(), dtype=tf.string, name=None), 'feature3': TensorSpec(shape=(), dtype=tf.float32, name=None)}>

ใช้การดำเนินการอย่างกระตือรือร้นเพื่อแสดงการสังเกตในชุดข้อมูล มีการสังเกต 10,000 รายการในชุดข้อมูลนี้ แต่คุณจะแสดงเฉพาะ 10 รายการแรกเท่านั้น ข้อมูลจะแสดงเป็นพจนานุกรมของคุณสมบัติ แต่ละรายการคือ tf.Tensor และองค์ประกอบ numpy ของเทนเซอร์นี้จะแสดงค่าของคุณสมบัติ:

for parsed_record in parsed_dataset.take(10):
  print(repr(parsed_record))

{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.5251196>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.29878703>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'dog'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.94824797>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.65749466>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.44873232>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'chicken'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.4338477>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.3367577>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=1.6851357>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.18152401>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.6988251>}

ที่นี่ ฟังก์ชัน tf.parse_example จะแตกฟิลด์ tf.train.Example ออกเป็นเทนเซอร์มาตรฐาน

ไฟล์ TFRecord ใน Python

โมดูล tf.io ยังมีฟังก์ชัน pure-Python สำหรับการอ่านและเขียนไฟล์ TFRecord

การเขียนไฟล์ TFRecord

ถัดไป เขียนข้อสังเกต 10,000 รายการลงในไฟล์ test.tfrecord การสังเกตแต่ละครั้งจะถูกแปลงเป็นข้อความ tf.train.Example จากนั้นเขียนลงในไฟล์ จากนั้นคุณสามารถตรวจสอบว่าไฟล์ test.tfrecord ถูกสร้างขึ้นแล้ว:

# Write the `tf.train.Example` observations to the file.
with tf.io.TFRecordWriter(filename) as writer:
  for i in range(n_observations):
    example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])
    writer.write(example)

du -sh {filename}

984K    test.tfrecord

การอ่านไฟล์ TFRecord

เมตริกซ์อนุกรมเหล่านี้สามารถแยกวิเคราะห์ได้อย่างง่ายดายโดยใช้ tf.train.Example.ParseFromString :

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

for raw_record in raw_dataset.take(1):
  example = tf.train.Example()
  example.ParseFromString(raw_record.numpy())
  print(example)

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.5251196026802063
      }
    }
  }
}

ที่ส่งคืนโปรโต tf.train.Example ซึ่งยากต่อการใช้งานตามที่เป็นอยู่ แต่โดยพื้นฐานแล้วเป็นตัวแทนของ a:

Dict[str,
     Union[List[float],
           List[int],
           List[str]]]

โค้ดต่อไปนี้แปลง Example เป็นพจนานุกรมของอาร์เรย์ NumPy ด้วยตนเอง โดยไม่ต้องใช้ TensorFlow Ops อ้างถึง ไฟล์ PROTO สำหรับ detials

result = {}
# example.features.feature is the dictionary
for key, feature in example.features.feature.items():
  # The values are the Feature objects which contain a `kind` which contains:
  # one of three fields: bytes_list, float_list, int64_list

  kind = feature.WhichOneof('kind')
  result[key] = np.array(getattr(feature, kind).value)

result

{'feature3': array([0.5251196]),
 'feature1': array([4]),
 'feature0': array([0]),
 'feature2': array([b'goat'], dtype='|S4')}

กวดวิชา: การอ่านและการเขียนข้อมูลภาพ

นี่เป็นตัวอย่างแบบ end-to-end ของวิธีการอ่านและเขียนข้อมูลภาพโดยใช้ TFRecords การใช้รูปภาพเป็นข้อมูลอินพุต คุณจะเขียนข้อมูลเป็นไฟล์ TFRecord จากนั้นอ่านไฟล์กลับและแสดงรูปภาพ

ซึ่งจะมีประโยชน์ ตัวอย่างเช่น หากคุณต้องการใช้แบบจำลองหลายแบบในชุดข้อมูลอินพุตเดียวกัน แทนที่จะจัดเก็บข้อมูลภาพดิบ มันสามารถประมวลผลล่วงหน้าในรูปแบบ TFRecords และสามารถใช้ในการประมวลผลและการสร้างแบบจำลองเพิ่มเติมทั้งหมด

อันดับแรก ให้ดาวน์โหลด รูปภาพ ของแมวในหิมะ และ รูปภาพ ของสะพานวิลเลียมสเบิร์ก นิวยอร์คที่กำลังก่อสร้าง

ดึงภาพ

cat_in_snow  = tf.keras.utils.get_file(
    '320px-Felis_catus-cat_on_snow.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')

williamsburg_bridge = tf.keras.utils.get_file(
    '194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg
24576/17858 [=========================================] - 0s 0us/step
32768/17858 [=======================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg
16384/15477 [===============================] - 0s 0us/step
24576/15477 [===============================================] - 0s 0us/step

display.display(display.Image(filename=cat_in_snow))
display.display(display.HTML('Image cc-by: <a "href=https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg">Von.grzanka</a>'))

jpeg

display.display(display.Image(filename=williamsburg_bridge))
display.display(display.HTML('<a "href=https://commons.wikimedia.org/wiki/File:New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg">From Wikimedia</a>'))

jpeg

เขียนไฟล์ TFRecord

เช่นเคย เข้ารหัสคุณลักษณะเป็นประเภทที่เข้ากันได้กับ tf.train.Example ซึ่งจะเก็บคุณลักษณะสตริงรูปภาพดิบ ตลอดจนความสูง ความกว้าง ความลึก และคุณลักษณะ label ที่กำหนดเอง หลังใช้เมื่อคุณเขียนไฟล์เพื่อแยกความแตกต่างระหว่างอิมเมจ cat และอิมเมจบริดจ์ ใช้ 0 สำหรับอิมเมจแมว และ 1 สำหรับอิมเมจบริดจ์:

image_labels = {
    cat_in_snow : 0,
    williamsburg_bridge : 1,
}

# This is an example, just using the cat image.
image_string = open(cat_in_snow, 'rb').read()

label = image_labels[cat_in_snow]

# Create a dictionary with features that may be relevant.
def image_example(image_string, label):
  image_shape = tf.io.decode_jpeg(image_string).shape

  feature = {
      'height': _int64_feature(image_shape[0]),
      'width': _int64_feature(image_shape[1]),
      'depth': _int64_feature(image_shape[2]),
      'label': _int64_feature(label),
      'image_raw': _bytes_feature(image_string),
  }

  return tf.train.Example(features=tf.train.Features(feature=feature))

for line in str(image_example(image_string, label)).split('\n')[:15]:
  print(line)
print('...')

features {
  feature {
    key: "depth"
    value {
      int64_list {
        value: 3
      }
    }
  }
  feature {
    key: "height"
    value {
      int64_list {
        value: 213
      }
...

สังเกตว่าตอนนี้ฟีเจอร์ทั้งหมดถูกเก็บไว้ในข้อความ tf.train.Example ถัดไป ใช้งานโค้ดด้านบนและเขียนข้อความตัวอย่างไปยังไฟล์ชื่อ images.tfrecords :

# Write the raw image files to `images.tfrecords`.
# First, process the two images into `tf.train.Example` messages.
# Then, write to a `.tfrecords` file.
record_file = 'images.tfrecords'
with tf.io.TFRecordWriter(record_file) as writer:
  for filename, label in image_labels.items():
    image_string = open(filename, 'rb').read()
    tf_example = image_example(image_string, label)
    writer.write(tf_example.SerializeToString())

du -sh {record_file}

36K images.tfrecords

อ่านไฟล์ TFRecord

ตอนนี้คุณมีไฟล์แล้ว — images.tfrecords — และตอนนี้สามารถวนซ้ำระเบียนในนั้นเพื่ออ่านสิ่งที่คุณเขียนกลับ เนื่องจากในตัวอย่างนี้ คุณจะสร้างภาพขึ้นมาใหม่เท่านั้น คุณลักษณะเดียวที่คุณต้องการคือสตริงภาพดิบ แตกไฟล์โดยใช้ getters ที่อธิบายข้างต้น ได้แก่ example.features.feature['image_raw'].bytes_list.value[0] คุณยังสามารถใช้ป้ายกำกับเพื่อกำหนดว่าเร็กคอร์ดใดเป็น cat และอันใดคือบริดจ์:

raw_image_dataset = tf.data.TFRecordDataset('images.tfrecords')

# Create a dictionary describing the features.
image_feature_description = {
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'depth': tf.io.FixedLenFeature([], tf.int64),
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.FixedLenFeature([], tf.string),
}

def _parse_image_function(example_proto):
  # Parse the input tf.train.Example proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, image_feature_description)

parsed_image_dataset = raw_image_dataset.map(_parse_image_function)
parsed_image_dataset

<MapDataset element_spec={'depth': TensorSpec(shape=(), dtype=tf.int64, name=None), 'height': TensorSpec(shape=(), dtype=tf.int64, name=None), 'image_raw': TensorSpec(shape=(), dtype=tf.string, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'width': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

กู้คืนรูปภาพจากไฟล์ TFRecord:

for image_features in parsed_image_dataset:
  image_raw = image_features['image_raw'].numpy()
  display.display(display.Image(data=image_raw))

jpeg

ติดตั้ง

tf.train.Example

ชนิดข้อมูลสำหรับ tf.train.Example

กำลังสร้าง tf.train.Example message

รายละเอียดรูปแบบ TFRecords

ไฟล์ TFRecord โดยใช้ tf.data

การเขียนไฟล์ TFRecord

การอ่านไฟล์ TFRecord

ไฟล์ TFRecord ใน Python

การเขียนไฟล์ TFRecord

การอ่านไฟล์ TFRecord

กวดวิชา: การอ่านและการเขียนข้อมูลภาพ

ดึงภาพ

เขียนไฟล์ TFRecord

อ่านไฟล์ TFRecord

`tf.train.Example`

ชนิดข้อมูลสำหรับ `tf.train.Example`

กำลังสร้าง `tf.train.Example` message

ไฟล์ TFRecord โดยใช้ `tf.data`