questions_naturelles

Descriptif :

Le corpus NQ contient des questions d'utilisateurs réels, et il nécessite des systèmes d'assurance qualité pour lire et comprendre un article Wikipédia entier qui peut ou non contenir la réponse à la question. L'inclusion de vraies questions d'utilisateurs et l'exigence selon laquelle les solutions doivent lire une page entière pour trouver la réponse font de NQ une tâche plus réaliste et plus difficile que les ensembles de données QA précédents.

Documentation complémentaire : Explorer sur Papers With Code
Page d' accueil : https://ai.google.com/research/NaturalQuestions/dataset
Code source : tfds.datasets.natural_questions.Builder
Versions :
- 0.0.2 : Aucune note de version.
- 0.1.0 (par défaut) : aucune note de version.
Taille du téléchargement : 41.97 GiB
Mise en cache automatique ( documentation ): Non
Fractionnements :

Diviser	Exemples
`'train'`	307 373
`'validation'`	7 830

Clés supervisées (Voir as_supervised doc ): None
Figure ( tfds.show_examples ) : non pris en charge.
Citation :

@article{47761,
title = {Natural Questions: a Benchmark for Question Answering Research},
author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year = {2019},
journal = {Transactions of the Association of Computational Linguistics}
}

natural_questions/default (configuration par défaut)

Description de la configuration : configuration par défaut de natural_questions
Taille du jeu de données : 90.26 GiB
Structure des fonctionnalités :

FeaturesDict({
    'annotations': Sequence({
        'id': string,
        'long_answer': FeaturesDict({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
        }),
        'short_answers': Sequence({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
            'text': Text(shape=(), dtype=string),
        }),
        'yes_no_answer': ClassLabel(shape=(), dtype=int64, num_classes=2),
    }),
    'document': FeaturesDict({
        'html': Text(shape=(), dtype=string),
        'title': Text(shape=(), dtype=string),
        'tokens': Sequence({
            'is_html': bool,
            'token': Text(shape=(), dtype=string),
        }),
        'url': Text(shape=(), dtype=string),
    }),
    'id': string,
    'question': FeaturesDict({
        'text': Text(shape=(), dtype=string),
        'tokens': Sequence(string),
    }),
})

Documentation des fonctionnalités :

Fonctionnalité	Classe	Forme	Dtype
	FonctionnalitésDict
annotations	Séquence
annotations/identifiant	Tenseur		chaîne
annotations/réponse_longue	FonctionnalitésDict
annotations/long_answer/end_byte	Tenseur		int64
annotations/long_answer/end_token	Tenseur		int64
annotations/long_answer/start_byte	Tenseur		int64
annotations/long_answer/start_token	Tenseur		int64
annotations/réponses_courtes	Séquence
annotations/short_answers/end_byte	Tenseur		int64
annotations/short_answers/end_token	Tenseur		int64
annotations/short_answers/start_byte	Tenseur		int64
annotations/short_answers/start_token	Tenseur		int64
annotations/réponses_courtes/texte	Texte		chaîne
annotations/oui_non_réponse	Étiquette de classe		int64
document	FonctionnalitésDict
document/html	Texte		chaîne
titre du document	Texte		chaîne
document/jetons	Séquence
document/tokens/is_html	Tenseur		bourdonner
document/jetons/jeton	Texte		chaîne
document/url	Texte		chaîne
identifiant	Tenseur		chaîne
question	FonctionnalitésDict
question/texte	Texte		chaîne
question/jetons	Séquence (tenseur)	(Aucun,)	chaîne

Exemples ( tfds.as_dataframe ):

natural_questions/longt5

Description de la config : natural_questions prétraitées comme dans le benchmark longT5
Taille du jeu de données : 8.91 GiB
Structure des fonctionnalités :

FeaturesDict({
    'all_answers': Sequence(Text(shape=(), dtype=string)),
    'answer': Text(shape=(), dtype=string),
    'context': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'question': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})

Documentation des fonctionnalités :

Fonctionnalité	Classe	Forme	Dtype
	FonctionnalitésDict
all_answers	Séquence (texte)	(Aucun,)	chaîne
répondre	Texte		chaîne
contexte	Texte		chaîne
identifiant	Texte		chaîne
question	Texte		chaîne
titre	Texte		chaîne

Exemples ( tfds.as_dataframe ):