참고자료:
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:harem')
- 설명 :
The HAREM is a Portuguese language corpus commonly used for Named Entity Recognition tasks. It includes about 93k words, from 129 different texts,
from several genres, and language varieties. The split of this dataset version follows the division made by [1], where 7% HAREM
documents are the validation set and the miniHAREM corpus (with about 65k words) is the test set. There are two versions of the dataset set,
a version that has a total of 10 different named entity classes (Person, Organization, Location, Value, Date, Title, Thing, Event,
Abstraction, and Other) and a "selective" version with only 5 classes (Person, Organization, Location, Value, and Date).
It's important to note that the original version of the HAREM dataset has 2 levels of NER details, namely "Category" and "Sub-type".
The dataset version processed here ONLY USE the "Category" level of the original dataset.
[1] Souza, Fábio, Rodrigo Nogueira, and Roberto Lotufo. "BERTimbau: Pretrained BERT Models for Brazilian Portuguese." Brazilian Conference on Intelligent Systems. Springer, Cham, 2020.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'test' | 128 |
'train' | 121 |
'validation' | 8 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"tokens": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"ner_tags": {
"feature": {
"num_classes": 21,
"names": [
"O",
"B-PESSOA",
"I-PESSOA",
"B-ORGANIZACAO",
"I-ORGANIZACAO",
"B-LOCAL",
"I-LOCAL",
"B-TEMPO",
"I-TEMPO",
"B-VALOR",
"I-VALOR",
"B-ABSTRACCAO",
"I-ABSTRACCAO",
"B-ACONTECIMENTO",
"I-ACONTECIMENTO",
"B-COISA",
"I-COISA",
"B-OBRA",
"I-OBRA",
"B-OUTRO",
"I-OUTRO"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}
선택적
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:harem/selective')
- 설명 :
The HAREM is a Portuguese language corpus commonly used for Named Entity Recognition tasks. It includes about 93k words, from 129 different texts,
from several genres, and language varieties. The split of this dataset version follows the division made by [1], where 7% HAREM
documents are the validation set and the miniHAREM corpus (with about 65k words) is the test set. There are two versions of the dataset set,
a version that has a total of 10 different named entity classes (Person, Organization, Location, Value, Date, Title, Thing, Event,
Abstraction, and Other) and a "selective" version with only 5 classes (Person, Organization, Location, Value, and Date).
It's important to note that the original version of the HAREM dataset has 2 levels of NER details, namely "Category" and "Sub-type".
The dataset version processed here ONLY USE the "Category" level of the original dataset.
[1] Souza, Fábio, Rodrigo Nogueira, and Roberto Lotufo. "BERTimbau: Pretrained BERT Models for Brazilian Portuguese." Brazilian Conference on Intelligent Systems. Springer, Cham, 2020.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'test' | 128 |
'train' | 121 |
'validation' | 8 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"tokens": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"ner_tags": {
"feature": {
"num_classes": 11,
"names": [
"O",
"B-PESSOA",
"I-PESSOA",
"B-ORGANIZACAO",
"I-ORGANIZACAO",
"B-LOCAL",
"I-LOCAL",
"B-TEMPO",
"I-TEMPO",
"B-VALOR",
"I-VALOR"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}