TensorFlow Veri Kümeleri

TFDS, TensorFlow, Jax ve diğer Makine Öğrenimi çerçeveleriyle kullanım için kullanıma hazır veri kümeleri koleksiyonu sağlar.

Verileri deterministik olarak indirmeyi ve hazırlamayı ve bir tf.data.Dataset (veya np.array ) oluşturmayı işler.

TensorFlow.org'da görüntüleyin Google Colab'da çalıştırın Kaynağı GitHub'da görüntüleyin Not defterini indir

Kurulum

TFDS iki paket halinde bulunur:

  • pip install tensorflow-datasets : Birkaç ayda bir yayınlanan kararlı sürüm.
  • pip install tfds-nightly : Her gün yayınlanan, veri kümelerinin son sürümlerini içerir.

Bu ortak çalışma tfds-nightly kullanır:

pip install -q tfds-nightly tensorflow matplotlib
tutucu1 l10n-yer
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

Kullanılabilir veri kümelerini bulun

Tüm veri kümesi oluşturucuları, tfds.core.DatasetBuilder alt sınıfıdır. Mevcut inşaatçıların listesini almak için tfds.list_builders() kullanın veya kataloğumuza bakın.

tfds.list_builders()
tutucu3 l10n-yer
['abstract_reasoning',
 'accentdb',
 'aeslc',
 'aflw2k3d',
 'ag_news_subset',
 'ai2_arc',
 'ai2_arc_with_ir',
 'amazon_us_reviews',
 'anli',
 'arc',
 'asset',
 'assin2',
 'bair_robot_pushing_small',
 'bccd',
 'beans',
 'bee_dataset',
 'big_patent',
 'bigearthnet',
 'billsum',
 'binarized_mnist',
 'binary_alpha_digits',
 'blimp',
 'booksum',
 'bool_q',
 'c4',
 'caltech101',
 'caltech_birds2010',
 'caltech_birds2011',
 'cardiotox',
 'cars196',
 'cassava',
 'cats_vs_dogs',
 'celeb_a',
 'celeb_a_hq',
 'cfq',
 'cherry_blossoms',
 'chexpert',
 'cifar10',
 'cifar100',
 'cifar10_1',
 'cifar10_corrupted',
 'citrus_leaves',
 'cityscapes',
 'civil_comments',
 'clevr',
 'clic',
 'clinc_oos',
 'cmaterdb',
 'cnn_dailymail',
 'coco',
 'coco_captions',
 'coil100',
 'colorectal_histology',
 'colorectal_histology_large',
 'common_voice',
 'coqa',
 'cos_e',
 'cosmos_qa',
 'covid19',
 'covid19sum',
 'crema_d',
 'cs_restaurants',
 'curated_breast_imaging_ddsm',
 'cycle_gan',
 'd4rl_adroit_door',
 'd4rl_adroit_hammer',
 'd4rl_adroit_pen',
 'd4rl_adroit_relocate',
 'd4rl_antmaze',
 'd4rl_mujoco_ant',
 'd4rl_mujoco_halfcheetah',
 'd4rl_mujoco_hopper',
 'd4rl_mujoco_walker2d',
 'dart',
 'davis',
 'deep_weeds',
 'definite_pronoun_resolution',
 'dementiabank',
 'diabetic_retinopathy_detection',
 'diamonds',
 'div2k',
 'dmlab',
 'doc_nli',
 'dolphin_number_word',
 'domainnet',
 'downsampled_imagenet',
 'drop',
 'dsprites',
 'dtd',
 'duke_ultrasound',
 'e2e_cleaned',
 'efron_morris75',
 'emnist',
 'eraser_multi_rc',
 'esnli',
 'eurosat',
 'fashion_mnist',
 'flic',
 'flores',
 'food101',
 'forest_fires',
 'fuss',
 'gap',
 'geirhos_conflict_stimuli',
 'gem',
 'genomics_ood',
 'german_credit_numeric',
 'gigaword',
 'glue',
 'goemotions',
 'gov_report',
 'gpt3',
 'gref',
 'groove',
 'grounded_scan',
 'gsm8k',
 'gtzan',
 'gtzan_music_speech',
 'hellaswag',
 'higgs',
 'horses_or_humans',
 'howell',
 'i_naturalist2017',
 'i_naturalist2018',
 'imagenet2012',
 'imagenet2012_corrupted',
 'imagenet2012_multilabel',
 'imagenet2012_real',
 'imagenet2012_subset',
 'imagenet_a',
 'imagenet_lt',
 'imagenet_r',
 'imagenet_resized',
 'imagenet_sketch',
 'imagenet_v2',
 'imagenette',
 'imagewang',
 'imdb_reviews',
 'irc_disentanglement',
 'iris',
 'istella',
 'kddcup99',
 'kitti',
 'kmnist',
 'lambada',
 'lfw',
 'librispeech',
 'librispeech_lm',
 'libritts',
 'ljspeech',
 'lm1b',
 'locomotion',
 'lost_and_found',
 'lsun',
 'lvis',
 'malaria',
 'math_dataset',
 'math_qa',
 'mctaco',
 'mlqa',
 'mnist',
 'mnist_corrupted',
 'movie_lens',
 'movie_rationales',
 'movielens',
 'moving_mnist',
 'mslr_web',
 'multi_news',
 'multi_nli',
 'multi_nli_mismatch',
 'natural_questions',
 'natural_questions_open',
 'newsroom',
 'nsynth',
 'nyu_depth_v2',
 'ogbg_molpcba',
 'omniglot',
 'open_images_challenge2019_detection',
 'open_images_v4',
 'openbookqa',
 'opinion_abstracts',
 'opinosis',
 'opus',
 'oxford_flowers102',
 'oxford_iiit_pet',
 'para_crawl',
 'pass',
 'patch_camelyon',
 'paws_wiki',
 'paws_x_wiki',
 'penguins',
 'pet_finder',
 'pg19',
 'piqa',
 'places365_small',
 'plant_leaves',
 'plant_village',
 'plantae_k',
 'protein_net',
 'qa4mre',
 'qasc',
 'quac',
 'quality',
 'quickdraw_bitmap',
 'race',
 'radon',
 'reddit',
 'reddit_disentanglement',
 'reddit_tifu',
 'ref_coco',
 'resisc45',
 'rlu_atari',
 'rlu_atari_checkpoints',
 'rlu_atari_checkpoints_ordered',
 'rlu_dmlab_explore_object_rewards_few',
 'rlu_dmlab_explore_object_rewards_many',
 'rlu_dmlab_rooms_select_nonmatching_object',
 'rlu_dmlab_rooms_watermaze',
 'rlu_dmlab_seekavoid_arena01',
 'rlu_rwrl',
 'robomimic_ph',
 'robonet',
 'robosuite_panda_pick_place_can',
 'rock_paper_scissors',
 'rock_you',
 's3o4d',
 'salient_span_wikipedia',
 'samsum',
 'savee',
 'scan',
 'scene_parse150',
 'schema_guided_dialogue',
 'scicite',
 'scientific_papers',
 'scrolls',
 'sentiment140',
 'shapes3d',
 'siscore',
 'smallnorb',
 'smartwatch_gestures',
 'snli',
 'so2sat',
 'speech_commands',
 'spoken_digit',
 'squad',
 'squad_question_generation',
 'stanford_dogs',
 'stanford_online_products',
 'star_cfq',
 'starcraft_video',
 'stl10',
 'story_cloze',
 'summscreen',
 'sun397',
 'super_glue',
 'svhn_cropped',
 'symmetric_solids',
 'tao',
 'ted_hrlr_translate',
 'ted_multi_translate',
 'tedlium',
 'tf_flowers',
 'the300w_lp',
 'tiny_shakespeare',
 'titanic',
 'trec',
 'trivia_qa',
 'tydi_qa',
 'uc_merced',
 'ucf101',
 'vctk',
 'visual_domain_decathlon',
 'voc',
 'voxceleb',
 'voxforge',
 'waymo_open_dataset',
 'web_nlg',
 'web_questions',
 'wider_face',
 'wiki40b',
 'wiki_auto',
 'wiki_bio',
 'wiki_table_questions',
 'wiki_table_text',
 'wikiann',
 'wikihow',
 'wikipedia',
 'wikipedia_toxicity_subtypes',
 'wine_quality',
 'winogrande',
 'wit',
 'wit_kaggle',
 'wmt13_translate',
 'wmt14_translate',
 'wmt15_translate',
 'wmt16_translate',
 'wmt17_translate',
 'wmt18_translate',
 'wmt19_translate',
 'wmt_t2t_translate',
 'wmt_translate',
 'wordnet',
 'wsc273',
 'xnli',
 'xquad',
 'xsum',
 'xtreme_pawsx',
 'xtreme_xnli',
 'yelp_polarity_reviews',
 'yes_no',
 'youtube_vis',
 'huggingface:acronym_identification',
 'huggingface:ade_corpus_v2',
 'huggingface:adversarial_qa',
 'huggingface:aeslc',
 'huggingface:afrikaans_ner_corpus',
 'huggingface:ag_news',
 'huggingface:ai2_arc',
 'huggingface:air_dialogue',
 'huggingface:ajgt_twitter_ar',
 'huggingface:allegro_reviews',
 'huggingface:allocine',
 'huggingface:alt',
 'huggingface:amazon_polarity',
 'huggingface:amazon_reviews_multi',
 'huggingface:amazon_us_reviews',
 'huggingface:ambig_qa',
 'huggingface:americas_nli',
 'huggingface:ami',
 'huggingface:amttl',
 'huggingface:anli',
 'huggingface:app_reviews',
 'huggingface:aqua_rat',
 'huggingface:aquamuse',
 'huggingface:ar_cov19',
 'huggingface:ar_res_reviews',
 'huggingface:ar_sarcasm',
 'huggingface:arabic_billion_words',
 'huggingface:arabic_pos_dialect',
 'huggingface:arabic_speech_corpus',
 'huggingface:arcd',
 'huggingface:arsentd_lev',
 'huggingface:art',
 'huggingface:arxiv_dataset',
 'huggingface:ascent_kb',
 'huggingface:aslg_pc12',
 'huggingface:asnq',
 'huggingface:asset',
 'huggingface:assin',
 'huggingface:assin2',
 'huggingface:atomic',
 'huggingface:autshumato',
 'huggingface:babi_qa',
 'huggingface:banking77',
 'huggingface:bbaw_egyptian',
 'huggingface:bbc_hindi_nli',
 'huggingface:bc2gm_corpus',
 'huggingface:beans',
 'huggingface:best2009',
 'huggingface:bianet',
 'huggingface:bible_para',
 'huggingface:big_patent',
 'huggingface:billsum',
 'huggingface:bing_coronavirus_query_set',
 'huggingface:biomrc',
 'huggingface:biosses',
 'huggingface:blbooksgenre',
 'huggingface:blended_skill_talk',
 'huggingface:blimp',
 'huggingface:blog_authorship_corpus',
 'huggingface:bn_hate_speech',
 'huggingface:bookcorpus',
 'huggingface:bookcorpusopen',
 'huggingface:boolq',
 'huggingface:bprec',
 'huggingface:break_data',
 'huggingface:brwac',
 'huggingface:bsd_ja_en',
 'huggingface:bswac',
 'huggingface:c3',
 'huggingface:c4',
 'huggingface:cail2018',
 'huggingface:caner',
 'huggingface:capes',
 'huggingface:casino',
 'huggingface:catalonia_independence',
 'huggingface:cats_vs_dogs',
 'huggingface:cawac',
 'huggingface:cbt',
 'huggingface:cc100',
 'huggingface:cc_news',
 'huggingface:ccaligned_multilingual',
 'huggingface:cdsc',
 'huggingface:cdt',
 'huggingface:cedr',
 'huggingface:cfq',
 'huggingface:chr_en',
 'huggingface:cifar10',
 'huggingface:cifar100',
 'huggingface:circa',
 'huggingface:civil_comments',
 'huggingface:clickbait_news_bg',
 'huggingface:climate_fever',
 'huggingface:clinc_oos',
 'huggingface:clue',
 'huggingface:cmrc2018',
 'huggingface:cmu_hinglish_dog',
 'huggingface:cnn_dailymail',
 'huggingface:coached_conv_pref',
 'huggingface:coarse_discourse',
 'huggingface:codah',
 'huggingface:code_search_net',
 'huggingface:code_x_glue_cc_clone_detection_big_clone_bench',
 'huggingface:code_x_glue_cc_clone_detection_poj104',
 'huggingface:code_x_glue_cc_cloze_testing_all',
 'huggingface:code_x_glue_cc_cloze_testing_maxmin',
 'huggingface:code_x_glue_cc_code_completion_line',
 'huggingface:code_x_glue_cc_code_completion_token',
 'huggingface:code_x_glue_cc_code_refinement',
 'huggingface:code_x_glue_cc_code_to_code_trans',
 'huggingface:code_x_glue_cc_defect_detection',
 'huggingface:code_x_glue_ct_code_to_text',
 'huggingface:code_x_glue_tc_nl_code_search_adv',
 'huggingface:code_x_glue_tc_text_to_code',
 'huggingface:code_x_glue_tt_text_to_text',
 'huggingface:com_qa',
 'huggingface:common_gen',
 'huggingface:common_language',
 'huggingface:common_voice',
 'huggingface:commonsense_qa',
 'huggingface:competition_math',
 'huggingface:compguesswhat',
 'huggingface:conceptnet5',
 'huggingface:conll2000',
 'huggingface:conll2002',
 'huggingface:conll2003',
 'huggingface:conllpp',
 'huggingface:conv_ai',
 'huggingface:conv_ai_2',
 'huggingface:conv_ai_3',
 'huggingface:conv_questions',
 'huggingface:coqa',
 'huggingface:cord19',
 'huggingface:cornell_movie_dialog',
 'huggingface:cos_e',
 'huggingface:cosmos_qa',
 'huggingface:counter',
 'huggingface:covid_qa_castorini',
 'huggingface:covid_qa_deepset',
 'huggingface:covid_qa_ucsd',
 'huggingface:covid_tweets_japanese',
 'huggingface:covost2',
 'huggingface:craigslist_bargains',
 'huggingface:crawl_domain',
 'huggingface:crd3',
 'huggingface:crime_and_punish',
 'huggingface:crows_pairs',
 'huggingface:cryptonite',
 'huggingface:cs_restaurants',
 'huggingface:cuad',
 'huggingface:curiosity_dialogs',
 'huggingface:daily_dialog',
 'huggingface:dane',
 'huggingface:danish_political_comments',
 'huggingface:dart',
 'huggingface:datacommons_factcheck',
 'huggingface:dbpedia_14',
 'huggingface:dbrd',
 'huggingface:deal_or_no_dialog',
 'huggingface:definite_pronoun_resolution',
 'huggingface:dengue_filipino',
 'huggingface:dialog_re',
 'huggingface:diplomacy_detection',
 'huggingface:disaster_response_messages',
 'huggingface:discofuse',
 'huggingface:discovery',
 'huggingface:disfl_qa',
 'huggingface:doc2dial',
 'huggingface:docred',
 'huggingface:doqa',
 'huggingface:dream',
 'huggingface:drop',
 'huggingface:duorc',
 'huggingface:dutch_social',
 'huggingface:dyk',
 'huggingface:e2e_nlg',
 'huggingface:e2e_nlg_cleaned',
 'huggingface:ecb',
 'huggingface:ecthr_cases',
 'huggingface:eduge',
 'huggingface:ehealth_kd',
 'huggingface:eitb_parcc',
 'huggingface:eli5',
 'huggingface:eli5_category',
 'huggingface:emea',
 'huggingface:emo',
 'huggingface:emotion',
 'huggingface:emotone_ar',
 'huggingface:empathetic_dialogues',
 'huggingface:enriched_web_nlg',
 'huggingface:eraser_multi_rc',
 'huggingface:esnli',
 'huggingface:eth_py150_open',
 'huggingface:ethos',
 'huggingface:eu_regulatory_ir',
 'huggingface:eurlex',
 'huggingface:euronews',
 'huggingface:europa_eac_tm',
 'huggingface:europa_ecdc_tm',
 'huggingface:europarl_bilingual',
 'huggingface:event2Mind',
 'huggingface:evidence_infer_treatment',
 'huggingface:exams',
 'huggingface:factckbr',
 'huggingface:fake_news_english',
 'huggingface:fake_news_filipino',
 'huggingface:farsi_news',
 'huggingface:fashion_mnist',
 'huggingface:fever',
 'huggingface:few_rel',
 'huggingface:financial_phrasebank',
 'huggingface:finer',
 'huggingface:flores',
 'huggingface:flue',
 'huggingface:food101',
 'huggingface:fquad',
 'huggingface:freebase_qa',
 'huggingface:gap',
 'huggingface:gem',
 'huggingface:generated_reviews_enth',
 'huggingface:generics_kb',
 'huggingface:german_legal_entity_recognition',
 'huggingface:germaner',
 'huggingface:germeval_14',
 'huggingface:giga_fren',
 'huggingface:gigaword',
 'huggingface:glucose',
 'huggingface:glue',
 'huggingface:gnad10',
 'huggingface:go_emotions',
 'huggingface:gooaq',
 'huggingface:google_wellformed_query',
 'huggingface:grail_qa',
 'huggingface:great_code',
 'huggingface:greek_legal_code',
 'huggingface:guardian_authorship',
 'huggingface:gutenberg_time',
 'huggingface:hans',
 'huggingface:hansards',
 'huggingface:hard',
 'huggingface:harem',
 'huggingface:has_part',
 'huggingface:hate_offensive',
 'huggingface:hate_speech18',
 'huggingface:hate_speech_filipino',
 'huggingface:hate_speech_offensive',
 'huggingface:hate_speech_pl',
 'huggingface:hate_speech_portuguese',
 'huggingface:hatexplain',
 'huggingface:hausa_voa_ner',
 'huggingface:hausa_voa_topics',
 'huggingface:hda_nli_hindi',
 'huggingface:head_qa',
 'huggingface:health_fact',
 'huggingface:hebrew_projectbenyehuda',
 'huggingface:hebrew_sentiment',
 'huggingface:hebrew_this_world',
 'huggingface:hellaswag',
 'huggingface:hendrycks_test',
 'huggingface:hind_encorp',
 'huggingface:hindi_discourse',
 'huggingface:hippocorpus',
 'huggingface:hkcancor',
 'huggingface:hlgd',
 'huggingface:hope_edi',
 'huggingface:hotpot_qa',
 'huggingface:hover',
 'huggingface:hrenwac_para',
 'huggingface:hrwac',
 'huggingface:humicroedit',
 'huggingface:hybrid_qa',
 'huggingface:hyperpartisan_news_detection',
 'huggingface:iapp_wiki_qa_squad',
 'huggingface:id_clickbait',
 'huggingface:id_liputan6',
 'huggingface:id_nergrit_corpus',
 'huggingface:id_newspapers_2018',
 'huggingface:id_panl_bppt',
 'huggingface:id_puisi',
 'huggingface:igbo_english_machine_translation',
 'huggingface:igbo_monolingual',
 'huggingface:igbo_ner',
 'huggingface:ilist',
 'huggingface:imdb',
 'huggingface:imdb_urdu_reviews',
 'huggingface:imppres',
 'huggingface:indic_glue',
 'huggingface:indonli',
 'huggingface:indonlu',
 'huggingface:inquisitive_qg',
 'huggingface:interpress_news_category_tr',
 'huggingface:interpress_news_category_tr_lite',
 'huggingface:irc_disentangle',
 'huggingface:isixhosa_ner_corpus',
 'huggingface:isizulu_ner_corpus',
 'huggingface:iwslt2017',
 'huggingface:jeopardy',
 'huggingface:jfleg',
 'huggingface:jigsaw_toxicity_pred',
 'huggingface:jigsaw_unintended_bias',
 'huggingface:jnlpba',
 'huggingface:journalists_questions',
 'huggingface:kan_hope',
 'huggingface:kannada_news',
 'huggingface:kd_conv',
 'huggingface:kde4',
 'huggingface:kelm',
 'huggingface:kilt_tasks',
 'huggingface:kilt_wikipedia',
 'huggingface:kinnews_kirnews',
 'huggingface:klue',
 'huggingface:kor_3i4k',
 'huggingface:kor_hate',
 'huggingface:kor_ner',
 'huggingface:kor_nli',
 'huggingface:kor_nlu',
 'huggingface:kor_qpair',
 'huggingface:kor_sae',
 'huggingface:kor_sarcasm',
 'huggingface:labr',
 'huggingface:lama',
 'huggingface:lambada',
 'huggingface:large_spanish_corpus',
 'huggingface:laroseda',
 'huggingface:lc_quad',
 'huggingface:lener_br',
 'huggingface:lex_glue',
 'huggingface:liar',
 'huggingface:librispeech_asr',
 'huggingface:librispeech_lm',
 'huggingface:limit',
 'huggingface:lince',
 'huggingface:linnaeus',
 'huggingface:liveqa',
 'huggingface:lj_speech',
 'huggingface:lm1b',
 'huggingface:lst20',
 'huggingface:m_lama',
 'huggingface:mac_morpho',
 'huggingface:makhzan',
 'huggingface:masakhaner',
 'huggingface:math_dataset',
 'huggingface:math_qa',
 'huggingface:matinf',
 'huggingface:mbpp',
 'huggingface:mc4',
 'huggingface:mc_taco',
 'huggingface:md_gender_bias',
 'huggingface:mdd',
 'huggingface:med_hop',
 'huggingface:medal',
 'huggingface:medical_dialog',
 'huggingface:medical_questions_pairs',
 'huggingface:menyo20k_mt',
 'huggingface:meta_woz',
 'huggingface:metooma',
 'huggingface:metrec',
 'huggingface:miam',
 'huggingface:mkb',
 'huggingface:mkqa',
 'huggingface:mlqa',
 'huggingface:mlsum',
 'huggingface:mnist',
 'huggingface:mocha',
 'huggingface:moroco',
 'huggingface:movie_rationales',
 'huggingface:mrqa',
 'huggingface:ms_marco',
 'huggingface:ms_terms',
 'huggingface:msr_genomics_kbcomp',
 'huggingface:msr_sqa',
 'huggingface:msr_text_compression',
 'huggingface:msr_zhen_translation_parity',
 'huggingface:msra_ner',
 'huggingface:mt_eng_vietnamese',
 'huggingface:muchocine',
 'huggingface:multi_booked',
 'huggingface:multi_eurlex',
 'huggingface:multi_news',
 'huggingface:multi_nli',
 'huggingface:multi_nli_mismatch',
 'huggingface:multi_para_crawl',
 'huggingface:multi_re_qa',
 'huggingface:multi_woz_v22',
 'huggingface:multi_x_science_sum',
 'huggingface:multidoc2dial',
 'huggingface:multilingual_librispeech',
 'huggingface:mutual_friends',
 'huggingface:mwsc',
 'huggingface:myanmar_news',
 'huggingface:narrativeqa',
 'huggingface:narrativeqa_manual',
 'huggingface:natural_questions',
 'huggingface:ncbi_disease',
 'huggingface:nchlt',
 'huggingface:ncslgr',
 'huggingface:nell',
 'huggingface:neural_code_search',
 'huggingface:news_commentary',
 'huggingface:newsgroup',
 'huggingface:newsph',
 'huggingface:newsph_nli',
 'huggingface:newspop',
 'huggingface:newsqa',
 'huggingface:newsroom',
 'huggingface:nkjp-ner',
 'huggingface:nli_tr',
 'huggingface:nlu_evaluation_data',
 'huggingface:norec',
 'huggingface:norne',
 'huggingface:norwegian_ner',
 'huggingface:nq_open',
 'huggingface:nsmc',
 'huggingface:numer_sense',
 'huggingface:numeric_fused_head',
 'huggingface:oclar',
 'huggingface:offcombr',
 'huggingface:offenseval2020_tr',
 'huggingface:offenseval_dravidian',
 'huggingface:ofis_publik',
 'huggingface:ohsumed',
 'huggingface:ollie',
 'huggingface:omp',
 'huggingface:onestop_english',
 'huggingface:onestop_qa',
 'huggingface:open_subtitles',
 'huggingface:openai_humaneval',
 'huggingface:openbookqa',
 'huggingface:openslr',
 'huggingface:openwebtext',
 'huggingface:opinosis',
 'huggingface:opus100',
 'huggingface:opus_books',
 'huggingface:opus_dgt',
 'huggingface:opus_dogc',
 'huggingface:opus_elhuyar',
 'huggingface:opus_euconst',
 'huggingface:opus_finlex',
 'huggingface:opus_fiskmo',
 'huggingface:opus_gnome',
 'huggingface:opus_infopankki',
 'huggingface:opus_memat',
 'huggingface:opus_montenegrinsubs',
 'huggingface:opus_openoffice',
 'huggingface:opus_paracrawl',
 'huggingface:opus_rf',
 'huggingface:opus_tedtalks',
 'huggingface:opus_ubuntu',
 'huggingface:opus_wikipedia',
 'huggingface:opus_xhosanavy',
 'huggingface:orange_sum',
 'huggingface:oscar',
 'huggingface:para_crawl',
 'huggingface:para_pat',
 'huggingface:parsinlu_reading_comprehension',
 'huggingface:paws',
 'huggingface:paws-x',
 'huggingface:pec',
 'huggingface:peer_read',
 'huggingface:peoples_daily_ner',
 'huggingface:per_sent',
 'huggingface:persian_ner',
 'huggingface:pg19',
 'huggingface:php',
 'huggingface:piaf',
 'huggingface:pib',
 'huggingface:piqa',
 'huggingface:pn_summary',
 'huggingface:poem_sentiment',
 'huggingface:polemo2',
 'huggingface:poleval2019_cyberbullying',
 'huggingface:poleval2019_mt',
 'huggingface:polsum',
 'huggingface:polyglot_ner',
 'huggingface:prachathai67k',
 'huggingface:pragmeval',
 'huggingface:proto_qa',
 'huggingface:psc',
 'huggingface:ptb_text_only',
 'huggingface:pubmed',
 'huggingface:pubmed_qa',
 'huggingface:py_ast',
 'huggingface:qa4mre',
 'huggingface:qa_srl',
 'huggingface:qa_zre',
 'huggingface:qangaroo',
 'huggingface:qanta',
 'huggingface:qasc',
 'huggingface:qasper',
 'huggingface:qed',
 'huggingface:qed_amara',
 'huggingface:quac',
 'huggingface:quail',
 'huggingface:quarel',
 'huggingface:quartz',
 'huggingface:quora',
 'huggingface:quoref',
 'huggingface:race',
 'huggingface:re_dial',
 'huggingface:reasoning_bg',
 'huggingface:recipe_nlg',
 'huggingface:reclor',
 'huggingface:reddit',
 'huggingface:reddit_tifu',
 'huggingface:refresd',
 'huggingface:reuters21578',
 'huggingface:riddle_sense',
 'huggingface:ro_sent',
 'huggingface:ro_sts',
 'huggingface:ro_sts_parallel',
 'huggingface:roman_urdu',
 'huggingface:ronec',
 'huggingface:ropes',
 'huggingface:rotten_tomatoes',
 'huggingface:russian_super_glue',
 'huggingface:s2orc',
 'huggingface:samsum',
 'huggingface:sanskrit_classic',
 'huggingface:saudinewsnet',
 'huggingface:sberquad',
 'huggingface:scan',
 'huggingface:scb_mt_enth_2020',
 'huggingface:schema_guided_dstc8',
 'huggingface:scicite',
 'huggingface:scielo',
 'huggingface:scientific_papers',
 'huggingface:scifact',
 'huggingface:sciq',
 'huggingface:scitail',
 'huggingface:scitldr',
 'huggingface:search_qa',
 'huggingface:sede',
 'huggingface:selqa',
 'huggingface:sem_eval_2010_task_8',
 'huggingface:sem_eval_2014_task_1',
 'huggingface:sem_eval_2018_task_1',
 'huggingface:sem_eval_2020_task_11',
 'huggingface:sent_comp',
 'huggingface:senti_lex',
 'huggingface:senti_ws',
 'huggingface:sentiment140',
 'huggingface:sepedi_ner',
 'huggingface:sesotho_ner_corpus',
 'huggingface:setimes',
 'huggingface:setswana_ner_corpus',
 'huggingface:sharc',
 'huggingface:sharc_modified',
 'huggingface:sick',
 'huggingface:silicone',
 'huggingface:simple_questions_v2',
 'huggingface:siswati_ner_corpus',
 'huggingface:smartdata',
 'huggingface:sms_spam',
 'huggingface:snips_built_in_intents',
 'huggingface:snli',
 'huggingface:snow_simplified_japanese_corpus',
 'huggingface:so_stacksample',
 'huggingface:social_bias_frames',
 'huggingface:social_i_qa',
 'huggingface:sofc_materials_articles',
 'huggingface:sogou_news',
 'huggingface:spanish_billion_words',
 'huggingface:spc',
 'huggingface:species_800',
 'huggingface:speech_commands',
 'huggingface:spider',
 'huggingface:squad',
 'huggingface:squad_adversarial',
 'huggingface:squad_es',
 'huggingface:squad_it',
 'huggingface:squad_kor_v1',
 'huggingface:squad_kor_v2',
 'huggingface:squad_v1_pt',
 'huggingface:squad_v2',
 'huggingface:squadshifts',
 'huggingface:srwac',
 'huggingface:sst',
 'huggingface:stereoset',
 'huggingface:story_cloze',
 'huggingface:stsb_mt_sv',
 'huggingface:stsb_multi_mt',
 'huggingface:style_change_detection',
 'huggingface:subjqa',
 'huggingface:super_glue',
 'huggingface:superb',
 'huggingface:swag',
 'huggingface:swahili',
 'huggingface:swahili_news',
 'huggingface:swda',
 'huggingface:swedish_medical_ner',
 'huggingface:swedish_ner_corpus',
 'huggingface:swedish_reviews',
 'huggingface:swiss_judgment_prediction',
 'huggingface:tab_fact',
 'huggingface:tamilmixsentiment',
 'huggingface:tanzil',
 'huggingface:tapaco',
 'huggingface:tashkeela',
 'huggingface:taskmaster1',
 'huggingface:taskmaster2',
 'huggingface:taskmaster3',
 'huggingface:tatoeba',
 'huggingface:ted_hrlr',
 'huggingface:ted_iwlst2013',
 'huggingface:ted_multi',
 'huggingface:ted_talks_iwslt',
 'huggingface:telugu_books',
 'huggingface:telugu_news',
 'huggingface:tep_en_fa_para',
 'huggingface:thai_toxicity_tweet',
 'huggingface:thainer',
 'huggingface:thaiqa_squad',
 'huggingface:thaisum',
 'huggingface:the_pile',
 'huggingface:the_pile_books3',
 'huggingface:the_pile_openwebtext2',
 'huggingface:the_pile_stack_exchange',
 'huggingface:tilde_model',
 'huggingface:time_dial',
 'huggingface:times_of_india_news_headlines',
 'huggingface:timit_asr',
 'huggingface:tiny_shakespeare',
 'huggingface:tlc',
 'huggingface:tmu_gfm_dataset',
 'huggingface:totto',
 'huggingface:trec',
 'huggingface:trivia_qa',
 'huggingface:tsac',
 'huggingface:ttc4900',
 'huggingface:tunizi',
 'huggingface:tuple_ie',
 'huggingface:turk',
 'huggingface:turkish_movie_sentiment',
 'huggingface:turkish_ner',
 'huggingface:turkish_product_reviews',
 'huggingface:turkish_shrinked_ner',
 'huggingface:turku_ner_corpus',
 'huggingface:tweet_eval',
 'huggingface:tweet_qa',
 'huggingface:tweets_ar_en_parallel',
 'huggingface:tweets_hate_speech_detection',
 'huggingface:twi_text_c3',
 'huggingface:twi_wordsim353',
 'huggingface:tydiqa',
 'huggingface:ubuntu_dialogs_corpus',
 'huggingface:udhr',
 'huggingface:um005',
 'huggingface:un_ga',
 'huggingface:un_multi',
 'huggingface:un_pc',
 'huggingface:universal_dependencies',
 'huggingface:universal_morphologies',
 'huggingface:urdu_fake_news',
 'huggingface:urdu_sentiment_corpus',
 'huggingface:vctk',
 'huggingface:vivos',
 'huggingface:web_nlg',
 'huggingface:web_of_science',
 'huggingface:web_questions',
 'huggingface:weibo_ner',
 'huggingface:wi_locness',
 'huggingface:wiki40b',
 'huggingface:wiki_asp',
 'huggingface:wiki_atomic_edits',
 'huggingface:wiki_auto',
 'huggingface:wiki_bio',
 'huggingface:wiki_dpr',
 'huggingface:wiki_hop',
 'huggingface:wiki_lingua',
 'huggingface:wiki_movies',
 'huggingface:wiki_qa',
 'huggingface:wiki_qa_ar',
 'huggingface:wiki_snippets',
 'huggingface:wiki_source',
 'huggingface:wiki_split',
 'huggingface:wiki_summary',
 'huggingface:wikiann',
 'huggingface:wikicorpus',
 'huggingface:wikihow',
 'huggingface:wikipedia',
 'huggingface:wikisql',
 'huggingface:wikitext',
 'huggingface:wikitext_tl39',
 'huggingface:wili_2018',
 'huggingface:wino_bias',
 'huggingface:winograd_wsc',
 'huggingface:winogrande',
 'huggingface:wiqa',
 'huggingface:wisesight1000',
 'huggingface:wisesight_sentiment',
 ...]

Bir veri kümesi yükleyin

tfds.load

Bir veri kümesini yüklemenin en kolay yolu tfds.load . O olacak:

  1. Verileri indirin ve tfrecord dosyaları olarak kaydedin.
  2. tf.data.Dataset yükleyin ve tfrecord oluşturun.
ds = tfds.load('mnist', split='train', shuffle_files=True)
assert isinstance(ds, tf.data.Dataset)
print(ds)
tutucu5 l10n-yer
<_OptionsDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>
2022-02-07 04:07:40.542243: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Bazı yaygın argümanlar:

  • split= : Hangi bölme okunacak (örneğin 'train' , ['train', 'test'] , 'train[80%:]' ,...). Bölünmüş API kılavuzumuza bakın.
  • shuffle_files= : Dosyaların her dönem arasında karıştırılıp karıştırılmayacağını kontrol edin (TFDS, büyük veri kümelerini birden çok küçük dosyada depolar).
  • data_dir= : Veri kümesinin kaydedildiği konum (varsayılan olarak ~/tensorflow_datasets/ )
  • with_info=True : Veri kümesi meta verilerini içeren tfds.core.DatasetInfo döndürür
  • download=False : İndirmeyi devre dışı bırak

tfds.builder

tfds.load , tfds.core.DatasetBuilder çevresinde ince bir sarmalayıcıdır. Aynı çıktıyı tfds.core.DatasetBuilder API'sini kullanarak da alabilirsiniz:

builder = tfds.builder('mnist')
# 1. Create the tfrecord files (no-op if already exists)
builder.download_and_prepare()
# 2. Load the `tf.data.Dataset`
ds = builder.as_dataset(split='train', shuffle_files=True)
print(ds)
tutucu7 l10n-yer
<_OptionsDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

tfds build CLI'yi oluşturur

Belirli bir veri kümesi oluşturmak istiyorsanız, tfds komut satırını kullanabilirsiniz. Örneğin:

tfds build mnist

Kullanılabilir bayraklar için dokümana bakın.

Bir veri kümesi üzerinde yineleme

dikte olarak

Varsayılan olarak, tf.data.Dataset nesnesi, tf.Tensor s'nin bir dict içerir:

ds = tfds.load('mnist', split='train')
ds = ds.take(1)  # Only take a single example

for example in ds:  # example is `{'image': tf.Tensor, 'label': tf.Tensor}`
  print(list(example.keys()))
  image = example["image"]
  label = example["label"]
  print(image.shape, label)
tutucu10 l10n-yer
['image', 'label']
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
2022-02-07 04:07:41.932638: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

dict tuşu adlarını ve yapısını öğrenmek için kataloğumuzdaki veri kümesi belgelerine bakın. Örneğin: mnist belgeleri .

Tuple olarak ( as_supervised=True )

as_supervised=True kullanarak, denetlenen veri kümeleri yerine bir demet (features, label) alabilirsiniz.

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in ds:  # example is (image, label)
  print(image.shape, label)
tutucu12 l10n-yer
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
2022-02-07 04:07:42.593594: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

Numpy olarak ( tfds.as_numpy )

Aşağıdakileri dönüştürmek için tfds.as_numpy kullanır:

  • tf.Tensor -> np.array
  • tf.data.Dataset -> Iterator[Tree[np.array]] ( Tree isteğe bağlı olarak iç içe Dict , Tuple olabilir)
ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.take(1)

for image, label in tfds.as_numpy(ds):
  print(type(image), type(label), label)
tutucu14 l10n-yer
<class 'numpy.ndarray'> <class 'numpy.int64'> 4
2022-02-07 04:07:43.220027: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

Toplu olarak tf.Tensor ( batch_size=-1 )

batch_size=-1 kullanarak, tüm veri kümesini tek bir toplu iş halinde yükleyebilirsiniz.

Bu, verileri (np.array, np.array) olarak almak için as_supervised=True ve tfds.as_numpy ile birleştirilebilir:

image, label = tfds.as_numpy(tfds.load(
    'mnist',
    split='test',
    batch_size=-1,
    as_supervised=True,
))

print(type(image), image.shape)
tutucu16 l10n-yer
<class 'numpy.ndarray'> (10000, 28, 28, 1)

Veri kümenizin belleğe sığabileceğine ve tüm örneklerin aynı şekle sahip olmasına dikkat edin.

Veri kümelerinizi kıyaslayın

Bir veri kümesini kıyaslama, herhangi bir yinelenebilir (örneğin tf.data.Dataset , tfds.as_numpy ,...) üzerinde basit bir tfds.benchmark çağrısıdır.

ds = tfds.load('mnist', split='train')
ds = ds.batch(32).prefetch(1)

tfds.benchmark(ds, batch_size=32)
tfds.benchmark(ds, batch_size=32)  # Second epoch much faster due to auto-caching
tutucu18 l10n-yer
************ Summary ************

Examples/sec (First included) 42295.82 ex/sec (total: 60000 ex, 1.42 sec)
Examples/sec (First only) 131.50 ex/sec (total: 32 ex, 0.24 sec)
Examples/sec (First excluded) 51026.08 ex/sec (total: 59968 ex, 1.18 sec)

************ Summary ************

Examples/sec (First included) 204278.25 ex/sec (total: 60000 ex, 0.29 sec)
Examples/sec (First only) 1444.72 ex/sec (total: 32 ex, 0.02 sec)
Examples/sec (First excluded) 220821.83 ex/sec (total: 59968 ex, 0.27 sec)
  • batch_size= kwarg ile toplu iş boyutu başına sonuçları normalleştirmeyi unutmayın.
  • Özetle, ilk ısınma grubu, tf.data.Dataset ekstra kurulum süresini (örn. arabellek başlatma,...) yakalamak için diğerlerinden ayrılır.
  • TFDS otomatik önbelleğe alma nedeniyle ikinci yinelemenin nasıl daha hızlı olduğuna dikkat edin.
  • tfds.benchmark , daha fazla analiz için incelenebilecek bir tfds.core.BenchmarkResult döndürür.

Uçtan uca ardışık düzen oluşturun

Daha ileri gitmek için şunlara bakabilirsiniz:

görselleştirme

tfds.as_dataframe

tf.data.Dataset nesneleri, Colab üzerinde görselleştirilmek üzere pandas.DataFrame ile tfds.as_dataframe dönüştürülebilir.

  • Görüntüleri, sesleri, metinleri, videoları görselleştirmek için tfds.as_dataframe tfds.core.DatasetInfo ikinci argümanı olarak ekleyin...
  • Yalnızca ilk x örneğini görüntülemek için ds.take(x) kullanın. pandas.DataFrame , tüm veri kümesini belleğe yükler ve görüntülenmesi çok pahalı olabilir.
ds, info = tfds.load('mnist', split='train', with_info=True)

tfds.as_dataframe(ds.take(4), info)
tutucu20 l10n-yer
2022-02-07 04:07:47.001241: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

tfds.show_examples

tfds.show_examples bir matplotlib.figure.Figure . Figure döndürür (şu anda yalnızca görüntü veri kümeleri desteklenir):

ds, info = tfds.load('mnist', split='train', with_info=True)

fig = tfds.show_examples(ds, info)
tutucu22 l10n-yer
2022-02-07 04:07:48.083706: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

png

Veri kümesi meta verilerine erişin

Tüm oluşturucular, veri kümesi meta verilerini içeren bir tfds.core.DatasetInfo nesnesi içerir.

Şunlardan erişilebilir:

ds, info = tfds.load('mnist', with_info=True)
builder = tfds.builder('mnist')
info = builder.info

Veri kümesi bilgisi, veri kümesi hakkında ek bilgiler içerir (sürüm, alıntı, ana sayfa, açıklama,...).

print(info)
tutucu26 l10n-yer
tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_path='gs://tensorflow-datasets/datasets/mnist/3.0.1',
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
)

Özellikler meta verileri (etiket adları, görüntü şekli,...)

tfds.features.FeatureDict :

info.features
tutucu28 l10n-yer
FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})

Sınıf sayısı, etiket adları:

print(info.features["label"].num_classes)
print(info.features["label"].names)
print(info.features["label"].int2str(7))  # Human readable version (8 -> 'cat')
print(info.features["label"].str2int('7'))
tutucu30 l10n-yer
10
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
7
7

Şekiller, tipler:

print(info.features.shape)
print(info.features.dtype)
print(info.features['image'].shape)
print(info.features['image'].dtype)
tutucu32 l10n-yer
{'image': (28, 28, 1), 'label': ()}
{'image': tf.uint8, 'label': tf.int64}
(28, 28, 1)
<dtype: 'uint8'>

Bölünmüş meta veriler (ör. bölünmüş adlar, örnek sayısı,...)

tfds.core.SplitDict :

print(info.splits)
tutucu34 l10n-yer
{'test': <SplitInfo num_examples=10000, num_shards=1>, 'train': <SplitInfo num_examples=60000, num_shards=1>}

Mevcut bölmeler:

print(list(info.splits.keys()))
tutucu36 l10n-yer
['test', 'train']

Bireysel bölünme hakkında bilgi alın:

print(info.splits['train'].num_examples)
print(info.splits['train'].filenames)
print(info.splits['train'].num_shards)
tutucu38 l10n-yer
60000
['gs://tensorflow-datasets/datasets/mnist/3.0.1/mnist-train.tfrecord-00000-of-00001']
1

Ayrıca alt bölünmüş API ile de çalışır:

print(info.splits['train[15%:75%]'].num_examples)
print(info.splits['train[15%:75%]'].file_instructions)
tutucu40 l10n-yer
36000
[FileInstruction(filename='gs://tensorflow-datasets/datasets/mnist/3.0.1/mnist-train.tfrecord-00000-of-00001', skip=9000, take=36000, num_examples=36000)]

Sorun giderme

Manuel indirme (indirme başarısız olursa)

İndirme herhangi bir nedenle başarısız olursa (örn. çevrimdışı,...). Verileri her zaman kendiniz manuel olarak indirebilir ve manual_dir (varsayılanı ~/tensorflow_datasets/download/manual/ .

Hangi URL'lerin indirileceğini öğrenmek için şunlara bakın:

NonMatchingChecksumError

TFDS, indirilen url'lerin sağlama toplamlarını doğrulayarak determinizmi sağlar. NonMatchingChecksumError yükseltilirse, şunları gösterebilir:

  • Web sitesi kapalı olabilir (örn. 503 status code ). Lütfen url'yi kontrol edin.
  • Google Drive URL'leri için, aynı URL'ye çok fazla kişi eriştiğinde Drive bazen indirmeleri reddettiği için daha sonra tekrar deneyin. Hataya bakın
  • Orijinal veri kümesi dosyaları güncellenmiş olabilir. Bu durumda TFDS veri seti oluşturucu güncellenmelidir. Lütfen yeni bir Github sorunu veya PR açın:
    • Yeni sağlama toplamlarını tfds build --register_checksums ile kaydedin
    • Sonunda veri kümesi oluşturma kodunu güncelleyin.
    • Veri kümesini güncelleyin VERSION
    • Veri kümesini güncelleyin RELEASE_NOTES : Sağlama toplamlarının değişmesine ne sebep oldu? Bazı örnekler değişti mi?
    • Veri kümesinin hala oluşturulabildiğinden emin olun.
    • Bize bir PR gönderin

Alıntı

Bir makale için tensorflow-datasets kullanıyorsanız, kullanılan veri kümelerine özgü herhangi bir alıntıya ek olarak ( veri kümesi kataloğunda bulunabilir) lütfen aşağıdaki alıntıyı ekleyin.

@misc{TFDS,
  title = { {TensorFlow Datasets}, A collection of ready-to-use datasets},
  howpublished = {\url{https://www.tensorflow.org/datasets} },
}