参考文献:
ベスト2009
次のコマンドを使用して、このデータセットを TFDS にロードします。
ds = tfds.load('huggingface:best2009/best2009')
- 説明:
`best2009` is a Thai word-tokenization dataset from encyclopedia, novels, news and articles by
[NECTEC](https://www.nectec.or.th/) (148,995/2,252 lines of train/test). It was created for
[BEST 2010: Word Tokenization Competition](https://thailang.nectec.or.th/archive/indexa290.html?q=node/10).
The test set answers are not provided publicly.
- ライセンス: CC-BY-NC-SA 3.0
- バージョン: 1.0.0
- 分割:
スプリット | 例 |
---|---|
'test' | 2252 |
'train' | 148995 |
- 特徴:
{
"fname": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"char": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"char_type": {
"feature": {
"num_classes": 12,
"names": [
"b_e",
"c",
"d",
"n",
"o",
"p",
"q",
"s",
"s_e",
"t",
"v",
"w"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"is_beginning": {
"feature": {
"num_classes": 2,
"names": [
"neg",
"pos"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}