참고자료:
2009년 최고
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:best2009/best2009')
- 설명 :
`best2009` is a Thai word-tokenization dataset from encyclopedia, novels, news and articles by
[NECTEC](https://www.nectec.or.th/) (148,995/2,252 lines of train/test). It was created for
[BEST 2010: Word Tokenization Competition](https://thailang.nectec.or.th/archive/indexa290.html?q=node/10).
The test set answers are not provided publicly.
- 라이센스 : CC-BY-NC-SA 3.0
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'test' | 2252 |
'train' | 148995 |
- 특징 :
{
"fname": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"char": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"char_type": {
"feature": {
"num_classes": 12,
"names": [
"b_e",
"c",
"d",
"n",
"o",
"p",
"q",
"s",
"s_e",
"t",
"v",
"w"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"is_beginning": {
"feature": {
"num_classes": 2,
"names": [
"neg",
"pos"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}