Tài liệu tham khảo:
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:counter')
- Sự miêu tả :
The COrpus of Urdu News TExt Reuse (COUNTER) corpus contains 1200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived.
- Giấy phép : Kho tài liệu được cấp phép theo Giấy phép Quốc tế Creative Commons Ghi công-Phi thương mại-Chia sẻ tương tự 4.0.
- Phiên bản : 1.0.0
- Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 600 |
- Đặc trưng :
{
"source": {
"filename": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"headline": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"body": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"total_number_of_words": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"total_number_of_sentences": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"number_of_words_with_swr": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"newspaper": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"newsdate": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"domain": {
"num_classes": 5,
"names": [
"business",
"sports",
"national",
"foreign",
"showbiz"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"classification": {
"num_classes": 3,
"names": [
"wholly_derived",
"partially_derived",
"not_derived"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
}
},
"derived": {
"filename": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"headline": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"body": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"total_number_of_words": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"total_number_of_sentences": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"number_of_words_with_swr": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"newspaper": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"newsdate": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"domain": {
"num_classes": 5,
"names": [
"business",
"sports",
"national",
"foreign",
"showbiz"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"classification": {
"num_classes": 3,
"names": [
"wholly_derived",
"partially_derived",
"not_derived"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
}
}
}