Tài liệu tham khảo:
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:harem')
- Sự miêu tả :
The HAREM is a Portuguese language corpus commonly used for Named Entity Recognition tasks. It includes about 93k words, from 129 different texts,
from several genres, and language varieties. The split of this dataset version follows the division made by [1], where 7% HAREM
documents are the validation set and the miniHAREM corpus (with about 65k words) is the test set. There are two versions of the dataset set,
a version that has a total of 10 different named entity classes (Person, Organization, Location, Value, Date, Title, Thing, Event,
Abstraction, and Other) and a "selective" version with only 5 classes (Person, Organization, Location, Value, and Date).
It's important to note that the original version of the HAREM dataset has 2 levels of NER details, namely "Category" and "Sub-type".
The dataset version processed here ONLY USE the "Category" level of the original dataset.
[1] Souza, Fábio, Rodrigo Nogueira, and Roberto Lotufo. "BERTimbau: Pretrained BERT Models for Brazilian Portuguese." Brazilian Conference on Intelligent Systems. Springer, Cham, 2020.
- Giấy phép : Không có giấy phép được biết đến
- Phiên bản : 1.0.0
- Chia tách :
Tách ra | Ví dụ |
---|---|
'test' | 128 |
'train' | 121 |
'validation' | 8 |
- Đặc trưng :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"tokens": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"ner_tags": {
"feature": {
"num_classes": 21,
"names": [
"O",
"B-PESSOA",
"I-PESSOA",
"B-ORGANIZACAO",
"I-ORGANIZACAO",
"B-LOCAL",
"I-LOCAL",
"B-TEMPO",
"I-TEMPO",
"B-VALOR",
"I-VALOR",
"B-ABSTRACCAO",
"I-ABSTRACCAO",
"B-ACONTECIMENTO",
"I-ACONTECIMENTO",
"B-COISA",
"I-COISA",
"B-OBRA",
"I-OBRA",
"B-OUTRO",
"I-OUTRO"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}
chọn lọc
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:harem/selective')
- Sự miêu tả :
The HAREM is a Portuguese language corpus commonly used for Named Entity Recognition tasks. It includes about 93k words, from 129 different texts,
from several genres, and language varieties. The split of this dataset version follows the division made by [1], where 7% HAREM
documents are the validation set and the miniHAREM corpus (with about 65k words) is the test set. There are two versions of the dataset set,
a version that has a total of 10 different named entity classes (Person, Organization, Location, Value, Date, Title, Thing, Event,
Abstraction, and Other) and a "selective" version with only 5 classes (Person, Organization, Location, Value, and Date).
It's important to note that the original version of the HAREM dataset has 2 levels of NER details, namely "Category" and "Sub-type".
The dataset version processed here ONLY USE the "Category" level of the original dataset.
[1] Souza, Fábio, Rodrigo Nogueira, and Roberto Lotufo. "BERTimbau: Pretrained BERT Models for Brazilian Portuguese." Brazilian Conference on Intelligent Systems. Springer, Cham, 2020.
- Giấy phép : Không có giấy phép được biết đến
- Phiên bản : 1.0.0
- Chia tách :
Tách ra | Ví dụ |
---|---|
'test' | 128 |
'train' | 121 |
'validation' | 8 |
- Đặc trưng :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"tokens": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"ner_tags": {
"feature": {
"num_classes": 11,
"names": [
"O",
"B-PESSOA",
"I-PESSOA",
"B-ORGANIZACAO",
"I-ORGANIZACAO",
"B-LOCAL",
"I-LOCAL",
"B-TEMPO",
"I-TEMPO",
"B-VALOR",
"I-VALOR"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}