참고자료:
squad_kor_v2
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:squad_kor_v2/squad_kor_v2')
- 설명 :
KorQuAD 2.0 is a Korean question and answering dataset consisting of a total of 100,000+ pairs. There are three major differences from KorQuAD 1.0, which is the standard Korean Q & A data. The first is that a given document is a whole Wikipedia page, not just one or two paragraphs. Second, because the document also contains tables and lists, it is necessary to understand the document structured with HTML tags. Finally, the answer can be a long text covering not only word or phrase units, but paragraphs, tables, and lists. As a baseline model, BERT Multilingual is used, released by Google as an open source. It shows 46.0% F1 score, a very low score compared to 85.7% of the human F1 score. It indicates that this data is a challenging task. Additionally, we increased the performance by no-answer data augmentation. Through the distribution of this data, we intend to extend the limit of MRC that was limited to plain text to real world tasks of various lengths and formats.
- 라이센스 : CC BY-ND 2.0 KR
- 버전 : 2.1.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 83486 |
'validation' | 10165 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"title": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"context": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"question": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"answer": {
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"answer_start": {
"dtype": "int32",
"id": null,
"_type": "Value"
},
"html_answer_start": {
"dtype": "int32",
"id": null,
"_type": "Value"
}
},
"url": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"raw_html": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}