তথ্যসূত্র:
id_newspapers_2018
TFDS এ এই ডেটাসেট লোড করতে নিম্নলিখিত কমান্ডটি ব্যবহার করুন:
ds = tfds.load('huggingface:id_newspapers_2018/id_newspapers_2018')
- বর্ণনা :
The dataset contains around 500K articles (136M of words) from 7 Indonesian newspapers: Detik, Kompas, Tempo,
CNN Indonesia, Sindo, Republika and Poskota. The articles are dated between 1st January 2018 and 20th August 2018
(with few exceptions dated earlier). The size of uncompressed 500K json files (newspapers-json.tgz) is around 2.2GB,
and the cleaned uncompressed in a big text file (newspapers.txt.gz) is about 1GB. The original source in Google Drive
contains also a dataset in html format which include raw data (pictures, css, javascript, ...)
from the online news website
- লাইসেন্স : ক্রিয়েটিভ কমন্স অ্যাট্রিবিউশন-শেয়ারঅ্যালাইক 4.0 আন্তর্জাতিক পাবলিক লাইসেন্স
- সংস্করণ : 1.0.0
- বিভাজন :
বিভক্ত | উদাহরণ |
---|---|
'train' | 499164 |
- বৈশিষ্ট্য :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"url": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"date": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"title": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"content": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}