참고자료:
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:brwac')
- 설명 :
The BrWaC (Brazilian Portuguese Web as Corpus) is a large corpus constructed following the Wacky framework,
which was made public for research purposes. The current corpus version, released in January 2017, is composed by
3.53 million documents, 2.68 billion tokens and 5.79 million types. Please note that this resource is available
solely for academic research purposes, and you agreed not to use it for any commercial applications.
Manually download at https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 3530796 |
- 특징 :
{
"doc_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"title": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"uri": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"text": {
"feature": {
"paragraphs": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}