TFDS hiện hỗ trợ định dạng Croissant 🥐 ! Đọc tài liệu để biết thêm.

Trang này được dịch bởi Cloud Translation API.

TFDS và thuyết tất định

Xem trên TensorFlow.org

Chạy trong Google Colab

Xem trên GitHub

Tải xuống sổ ghi chép

Tài liệu này giải thích:

TFDS đảm bảo về thuyết tất định
TFDS đọc ví dụ theo thứ tự nào
Các cảnh báo và gotchas khác nhau

Thành lập

Bộ dữ liệu

Một số ngữ cảnh là cần thiết để hiểu cách TFDS đọc dữ liệu.

Trong thế hệ, TFDS viết dữ liệu gốc vào chuẩn .tfrecord tập tin. Đối với các tập dữ liệu lớn, nhiều .tfrecord tập tin được tạo ra, mỗi chứa nhiều ví dụ. Chúng tôi gọi mỗi .tfrecord nộp một mảnh.

Hướng dẫn này sử dụng imagenet có 1024 phân đoạn:

import re
import tensorflow_datasets as tfds

imagenet = tfds.builder('imagenet2012')

num_shards = imagenet.info.splits['train'].num_shards
num_examples = imagenet.info.splits['train'].num_examples
print(f'imagenet has {num_shards} shards ({num_examples} examples)')

imagenet has 1024 shards (1281167 examples)

Tìm id ví dụ về tập dữ liệu

Bạn có thể bỏ qua phần sau nếu bạn chỉ muốn biết về thuyết xác định.

Mỗi ví dụ bộ dữ liệu được xác định duy nhất bởi một id (ví dụ 'imagenet2012-train.tfrecord-01023-of-01024__32' ). Bạn có thể phục hồi này id bằng cách thông qua read_config.add_tfds_id = True mà sẽ thêm một 'tfds_id' then chốt trong dict từ tf.data.Dataset .

Trong hướng dẫn này, chúng tôi xác định một công dụng nhỏ sẽ in id mẫu của tập dữ liệu (được chuyển đổi thành số nguyên để con người dễ đọc hơn):

def load_dataset(builder, **as_dataset_kwargs):
  """Load the dataset with the tfds_id."""
  read_config = as_dataset_kwargs.pop('read_config', tfds.ReadConfig())
  read_config.add_tfds_id = True  # Set `True` to return the 'tfds_id' key
  return builder.as_dataset(read_config=read_config, **as_dataset_kwargs)

def print_ex_ids(
    builder,
    *,
    take: int,
    skip: int = None,
    **as_dataset_kwargs,
) -> None:
  """Print the example ids from the given dataset split."""
  ds = load_dataset(builder, **as_dataset_kwargs)
  if skip:
    ds = ds.skip(skip)
  ds = ds.take(take)
  exs = [ex['tfds_id'].numpy().decode('utf-8') for ex in ds]
  exs = [id_to_int(tfds_id, builder=builder) for tfds_id in exs]
  print(exs)

def id_to_int(tfds_id: str, builder) -> str:
  """Format the tfds_id in a more human-readable."""
  match = re.match(r'\w+-(\w+).\w+-(\d+)-of-\d+__(\d+)', tfds_id)
  split_name, shard_id, ex_id = match.groups()
  split_info = builder.info.splits[split_name]
  return sum(split_info.shard_lengths[:int(shard_id)]) + int(ex_id)

Tính quyết đoán khi đọc

Phần này giải thích bảo deterministim của tfds.load .

Với `shuffle_files=False` (mặc định)

Bởi TFDS mặc định nhường ví dụ deterministically ( shuffle_files=False )

# Same as: imagenet.as_dataset(split='train').take(20)
print_ex_ids(imagenet, split='train', take=20)
print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

Để đạt hiệu quả, TFDS đọc nhiều mảnh cùng một lúc sử dụng tf.data.Dataset.interleave . Chúng ta thấy trong ví dụ này rằng TFDS chuyển sang mảnh 2 sau khi đọc 16 ví dụ ( ..., 14, 15, 1251, 1252, ... ). Nhiều hơn về xen kẽ dưới đây.

Tương tự, API subsplit cũng có tính xác định:

print_ex_ids(imagenet, split='train[67%:84%]', take=20)
print_ex_ids(imagenet, split='train[67%:84%]', take=20)

[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]
[858382, 858383, 858384, 858385, 858386, 858387, 858388, 858389, 858390, 858391, 858392, 858393, 858394, 858395, 858396, 858397, 859533, 859534, 859535, 859536]

Nếu bạn đang đào tạo cho nhiều hơn một kỷ nguyên, các thiết lập ở trên không được khuyến khích như tất cả các thời kỳ sẽ đọc mảnh theo thứ tự (do đó ngẫu nhiên được giới hạn ở những ds = ds.shuffle(buffer) , kích thước buffer).

Với `shuffle_files=True`

Với shuffle_files=True , mảnh vỡ đang xáo trộn cho mỗi thời đại, vì vậy đọc là không xác định được nữa.

print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)
print_ex_ids(imagenet, split='train', shuffle_files=True, take=20)

[568017, 329050, 329051, 329052, 329053, 329054, 329056, 329055, 568019, 568020, 568021, 568022, 568023, 568018, 568025, 568024, 568026, 568028, 568030, 568031]
[43790, 43791, 43792, 43793, 43796, 43794, 43797, 43798, 43795, 43799, 43800, 43801, 43802, 43803, 43804, 43805, 43806, 43807, 43809, 43810]

Xem công thức bên dưới để xáo trộn tệp xác định.

Báo trước chủ nghĩa quyết định: interleave args

Thay đổi read_config.interleave_cycle_length , read_config.interleave_block_length sẽ thay đổi thứ tự các ví dụ.

TFDS dựa vào tf.data.Dataset.interleave để chỉ tải một vài mảnh vỡ cùng một lúc, cải thiện hiệu suất và giảm sử dụng bộ nhớ.

Thứ tự ví dụ chỉ được đảm bảo giống nhau đối với một giá trị cố định của args xen kẽ. Xem doc xen kẽ để hiểu những gì cycle_length và block_length tương ứng với quá.

cycle_length=16 , block_length=16 (mặc định, tương tự như trên):

print_ex_ids(imagenet, split='train', take=20)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254]

cycle_length=3 , block_length=2 :

read_config = tfds.ReadConfig(
    interleave_cycle_length=3,
    interleave_block_length=2,
)
print_ex_ids(imagenet, split='train', read_config=read_config, take=20)

[0, 1, 1251, 1252, 2502, 2503, 2, 3, 1253, 1254, 2504, 2505, 4, 5, 1255, 1256, 2506, 2507, 6, 7]

Trong ví dụ thứ hai, chúng ta thấy rằng các số liệu đọc 2 ( block_length=2 ) ví dụ trong một mảnh, sau đó chuyển sang các mảnh vỡ tiếp theo. Mỗi 2 * 3 ( cycle_length=3 ) ví dụ, nó quay ngược lại với phân đoạn đầu tiên ( shard0-ex0, shard0-ex1, shard1-ex0, shard1-ex1, shard2-ex0, shard2-ex1, shard0-ex2, shard0-ex3, shard1-ex2, shard1-ex3, shard2-ex2,... ).

Subsplit và thứ tự ví dụ

Mỗi ví dụ có một id 0, 1, ..., num_examples-1 . Các API subsplit chọn một lát ví dụ (ví dụ như train[:x] chọn 0, 1, ..., x-1 ).

Tuy nhiên, trong subsplit, các ví dụ không được đọc theo thứ tự id tăng dần (do các phân đoạn và xen kẽ).

Cụ thể hơn, ds.take(x) và split='train[:x]' là không tương đương!

Bạn có thể dễ dàng thấy điều này trong ví dụ xen kẽ ở trên, nơi các ví dụ đến từ các phân đoạn khác nhau.

print_ex_ids(imagenet, split='train', take=25)  # tfds.load(..., split='train').take(25)
print_ex_ids(imagenet, split='train[:25]', take=-1)  # tfds.load(..., split='train[:25]')

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

Sau 16 (block_length) ví dụ, .take(25) chuyển sang các mảnh vỡ sau khi train[:25] tiếp tục đọc ví dụ trong từ các mảnh vỡ đầu tiên.

Công thức nấu ăn

Nhận xáo trộn tệp xác định

Có 2 cách để xáo trộn xác định:

Thiết lập shuffle_seed . Lưu ý: Điều này yêu cầu thay đổi hạt giống ở mỗi kỷ nguyên, nếu không các phân đoạn sẽ được đọc theo thứ tự giống nhau giữa các kỷ nguyên.

read_config = tfds.ReadConfig(
    shuffle_seed=32,
)

# Deterministic order, different from the default shuffle_files=False above
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)
print_ex_ids(imagenet, split='train', shuffle_files=True, read_config=read_config, take=22)

[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]
[176411, 176412, 176413, 176414, 176415, 176416, 176417, 176418, 176419, 176420, 176421, 176422, 176423, 176424, 176425, 176426, 710647, 710648, 710649, 710650, 710651, 710652]

Sử dụng experimental_interleave_sort_fn : Điều này cho phép toàn quyền kiểm soát mà mảnh vỡ được đọc và thứ tự, chứ không phải dựa vào ds.shuffle trật tự.

def _reverse_order(file_instructions):
  return list(reversed(file_instructions))

read_config = tfds.ReadConfig(
    experimental_interleave_sort_fn=_reverse_order,
)

# Last shard (01023-of-01024) is read first
print_ex_ids(imagenet, split='train', read_config=read_config, take=5)

[1279916, 1279917, 1279918, 1279919, 1279920]

Nhận đường ống dẫn chứng có thể xác định trước

Cái này phức tạp hơn. Không có giải pháp dễ dàng, thỏa đáng.

Nếu không có ds.shuffle và với shuffling xác định, về mặt lý thuyết chúng ta có thể đếm các ví dụ đã được đọc và suy ra mà ví dụ đã được đọc trong trong mỗi phân đoạn (như là một chức năng của cycle_length , block_length và trật tự mảnh). Sau đó, skip , take cho mỗi phân đoạn có thể được tiêm qua experimental_interleave_sort_fn .
Với ds.shuffle nó có khả năng không thể không phát lại các đường ống dẫn đào tạo đầy đủ. Nó sẽ yêu cầu lưu ds.shuffle nhà nước đệm để suy luận mà ví dụ đã được đọc. Ví dụ có thể là không liên tục (ví dụ shard5_ex2 , shard5_ex4 đọc nhưng không shard5_ex3 ).
Với ds.shuffle , một trong những cách sẽ là để lưu tất cả shards_ids / example_ids đọc (suy luận từ tfds_id ), sau đó suy ra như hướng dẫn tập tin từ đó.

Trường hợp đơn giản nhất để 1. là phải có .skip(x).take(y) phù hợp với train[x:x+y] trận đấu. Nó yêu cầu:

Set cycle_length=1 (do mảnh vỡ được đọc liên tục)
Set shuffle_files=False
Không sử dụng ds.shuffle

Nó chỉ nên được sử dụng trên tập dữ liệu khổng lồ mà quá trình đào tạo chỉ có 1 kỷ nguyên. Các ví dụ sẽ được đọc theo thứ tự xáo trộn mặc định.

read_config = tfds.ReadConfig(
    interleave_cycle_length=1,  # Read shards sequentially
)

print_ex_ids(imagenet, split='train', read_config=read_config, skip=40, take=22)
# If the job get pre-empted, using the subsplit API will skip at most `len(shard0)`
print_ex_ids(imagenet, split='train[40:]', read_config=read_config, take=22)

[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]

Tìm phân đoạn / ví dụ nào được đọc cho một subsplit nhất định

Với tfds.core.DatasetInfo , bạn có thể truy cập trực tiếp đến các hướng dẫn đọc.

imagenet.info.splits['train[44%:45%]'].file_instructions

[FileInstruction(filename='imagenet2012-train.tfrecord-00450-of-01024', skip=700, take=-1, num_examples=551),
 FileInstruction(filename='imagenet2012-train.tfrecord-00451-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00452-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00453-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00454-of-01024', skip=0, take=-1, num_examples=1252),
 FileInstruction(filename='imagenet2012-train.tfrecord-00455-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00456-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00457-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00458-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00459-of-01024', skip=0, take=-1, num_examples=1251),
 FileInstruction(filename='imagenet2012-train.tfrecord-00460-of-01024', skip=0, take=1001, num_examples=1001)]