natural_questions

설명 :

NQ 말뭉치에는 실제 사용자의 질문이 포함되어 있으며 QA 시스템이 질문에 대한 답변을 포함하거나 포함하지 않을 수 있는 전체 Wikipedia 기사를 읽고 이해해야 합니다. 실제 사용자 질문을 포함하고 솔루션이 답변을 찾기 위해 전체 페이지를 읽어야 한다는 요구 사항으로 인해 NQ는 이전 QA 데이터 세트보다 더 현실적이고 어려운 작업이 됩니다.

추가 문서 : 코드가 있는 논문에서 탐색
홈페이지 : https://ai.google.com/research/NaturalQuestions/dataset
소스 코드 : tfds.datasets.natural_questions.Builder
버전 :
- 0.0.2 : 릴리스 노트가 없습니다.
- 0.1.0 (기본값): 릴리스 정보가 없습니다.
다운로드 크기 : 41.97 GiB
자동 캐시 ( 문서 ): 아니요
분할 :

나뉘다	예
`'train'`	307,373
`'validation'`	7,830

감독된 키 ( as_supervised 문서 참조): None
그림 ( tfds.show_examples ): 지원되지 않습니다.
인용 :

@article{47761,
title = {Natural Questions: a Benchmark for Question Answering Research},
author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov},
year = {2019},
journal = {Transactions of the Association of Computational Linguistics}
}

natural_questions/default(기본 구성)

구성 설명 : 기본 natural_questions 구성
데이터세트 크기 : 90.26 GiB
기능 구조 :

FeaturesDict({
    'annotations': Sequence({
        'id': string,
        'long_answer': FeaturesDict({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
        }),
        'short_answers': Sequence({
            'end_byte': int64,
            'end_token': int64,
            'start_byte': int64,
            'start_token': int64,
            'text': Text(shape=(), dtype=string),
        }),
        'yes_no_answer': ClassLabel(shape=(), dtype=int64, num_classes=2),
    }),
    'document': FeaturesDict({
        'html': Text(shape=(), dtype=string),
        'title': Text(shape=(), dtype=string),
        'tokens': Sequence({
            'is_html': bool,
            'token': Text(shape=(), dtype=string),
        }),
        'url': Text(shape=(), dtype=string),
    }),
    'id': string,
    'question': FeaturesDict({
        'text': Text(shape=(), dtype=string),
        'tokens': Sequence(string),
    }),
})

기능 문서 :

특징	수업	모양	D타입
	풍모Dict
주석	순서
주석/ID	텐서		끈
주석/long_answer	풍모Dict
주석/long_answer/end_byte	텐서		int64
주석/long_answer/end_token	텐서		int64
주석/long_answer/start_byte	텐서		int64
주석/long_answer/start_token	텐서		int64
주석/단답형	순서
주석/short_answers/end_byte	텐서		int64
주석/short_answers/end_token	텐서		int64
주석/short_answers/start_byte	텐서		int64
주석/short_answers/start_token	텐서		int64
주석/short_answers/텍스트	텍스트		끈
주석/yes_no_answer	클래스 레이블		int64
문서	풍모Dict
문서/html	텍스트		끈
문서 제목	텍스트		끈
문서/토큰	순서
문서/토큰/is_html	텐서		부울
문서/토큰/토큰	텍스트		끈
문서/URL	텍스트		끈
ID	텐서		끈
질문	풍모Dict
질문/텍스트	텍스트		끈
질문/토큰	시퀀스(텐서)	(없음,)	끈

예 ( tfds.as_dataframe ):

natural_questions/longt5

구성 설명 : longT5 벤치마크에서와 같이 전처리된 natural_questions
데이터세트 크기 : 8.91 GiB
기능 구조 :

FeaturesDict({
    'all_answers': Sequence(Text(shape=(), dtype=string)),
    'answer': Text(shape=(), dtype=string),
    'context': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'question': Text(shape=(), dtype=string),
    'title': Text(shape=(), dtype=string),
})

기능 문서 :

특징	수업	모양	D타입
	풍모Dict
all_answers	시퀀스(텍스트)	(없음,)	끈
답변	텍스트		끈
문맥	텍스트		끈
ID	텍스트		끈
질문	텍스트		끈
제목	텍스트		끈

예 ( tfds.as_dataframe ):