Referensi:
tidak diacak_deduplikasi_af
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 130640 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_als
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4518 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak diacak_deduplikasi_arz
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 79928 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_an
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2025 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak diacak_deduplikasi_ast
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5343 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ba
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 27050 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_am
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 43102 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_as
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9212 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_azb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9985 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_diduplikasi_menjadi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 307405 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_bo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 15762 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_bxr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 36 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ceb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 26145 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_az
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 626796 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_bcl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_cy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 98225 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_dsb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 37 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_bn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1114481 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_bs
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 702 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ce
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2984 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak diacak_deduplikasi_cv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 10130 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_diq
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_eml
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 80 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_et
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1172041 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_bg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3398679 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak diacak_deduplikasi_bpy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1770 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_ca
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2458067 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ckb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 68210 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ar
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9006977 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_av
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 360 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bar
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan kemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativeCommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bh
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 82 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_br
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 14724 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cbk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_da
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4771098 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 17024 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 84752 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 8203495 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 20661 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 68 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cs
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 12308039 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1909387 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6582908 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ie
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 11 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 59448891 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gd
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3883 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 169834 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hsb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3084 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ia
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 529 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_io
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 617 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jbo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 617 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_km
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 108346 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ku
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 29054 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_la
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 18808 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lmo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1374 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 843195 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_min
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 166 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 212556 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mwl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nah
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 58 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2126 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6485 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 67921 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 28522082 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 372158 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5044757 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 17 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3675420 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 68 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1381 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 72 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 13343 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 453904 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 183443 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 8714 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 109118 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2559 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pms
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2859 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_qu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 411 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7121 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2820821 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sh
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 17610 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_so
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 42 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 645747 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ta
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 833101 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4694 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tyv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 24 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uz
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 15074 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 677 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xmf
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2418 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 11014487 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 56259 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_de
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 62398034 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 11596446 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_el
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6521169 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7782375 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9897709 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wuu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 64 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 49 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_als
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_als')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7324 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_arz
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 158113 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_az
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_az')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 912330 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bcl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1675515 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bs
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2143 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ce
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4042 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 20281 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_diq
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eml
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 84 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_et
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_et')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2093621 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_zh
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 41708901 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_an
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_an')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2449 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ast
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6999 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ba
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 42551 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5869686 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bpy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6046 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ca
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4390754 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ckb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 103639 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_es
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 56326016 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_da
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_da')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7664010 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 21018 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 121168 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5326443 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ga
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 46493 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gom
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 484 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 321484 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 396093 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ilo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1578 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 13704702 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 33053 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 106 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3264660 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 11197780 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ie
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 101 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ja
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 39496439 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 338073 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_krc
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1377 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ky
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 86561 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_li
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 118 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lt
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1737411 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mhr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2515 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 197878 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mt
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 16383 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mzn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 917 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ne
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 219334 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_no
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3229940 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 87235 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pnb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3463 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_rm
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 34 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sah
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 8555 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_si
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 120684 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sq
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 461598 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sw
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 24803 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_th
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3749826 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tt
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 82738 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ur
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 428674 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3317 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xal
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 36 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yue
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_am
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_am')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 83663 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_as
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_as')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 14985 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_azb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 15446 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_be
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_be')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 586031 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 26795 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bxr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 42 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ceb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 56248 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 157698 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dsb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 65 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 96742378 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gd
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5799 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 240691 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hsb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7959 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ia
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1040 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_io
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_io')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 694 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jbo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 832 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_km
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_km')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 159363 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ku
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 46535 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_la
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_la')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 94588 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lmo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1401 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1593820 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_min
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_min')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 220 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 326804 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mwl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 8 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nah
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 61 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_new
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_new')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4696 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_oc
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 10709 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pam
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ps
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 98216 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ro
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9387265 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_scn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 21 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5492194 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1013619 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ta
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1263280 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6456 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tyv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 34 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uz
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 27537 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1001 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xmf
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3783 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_it
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_it')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 46981781 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ka
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 563916 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ko
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7345075 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kw
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 203 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lez
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1485 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lrc
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 88 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 17957 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ml
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 603937 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ms
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 534016 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_myv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nds
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 18174 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 185884 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_os
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_os')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5213 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pms
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3225 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_qu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 452 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 14291 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sh
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 36700 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_so
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_so')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 156 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 17395625 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 89002 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 18535253 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 12973467 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 14898250 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wuu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 214 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 214 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_zh
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 60137667 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_en
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 304230423 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 256513 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_frr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 284320 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_he
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2375030 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ht
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_id
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9948521 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_is
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 389515 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1163 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 251064 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 924 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 21735 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 32652 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mai
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 25 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 299457 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mrj
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 669 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_my
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 136639 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nap
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 55 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 20812149 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_or
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 44230 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 20682611 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pt
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 26920397 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ru
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 115954598 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sd
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 33925 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 886223 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_su
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 511 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_te
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 312644 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 294132 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ug
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 15503 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vec
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 64 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_war
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9161 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 32919 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_af
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_af')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 201117 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ar
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 16365602 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_av
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_av')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 456 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bar
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bh
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 336 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_br
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_br')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 37085 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cbk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cs
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 21001388 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_de
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_de')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 104913504 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_el
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_el')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 10425596 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_es
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_es')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 88199221 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 8557453 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ga
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 83223 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gom
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 640 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 582219 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 659430 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ilo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2638 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ja
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 62721527 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 524591 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_krc
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1581 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ky
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 146993 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_li
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_li')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 137 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lt
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2977757 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mhr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3212 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 395605 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mt
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 26598 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mzn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1055 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ne
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 299938 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_no
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_no')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5546211 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 127467 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pnb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4599 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_rm
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 41 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sah
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 22301 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_si
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_si')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 203082 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sq
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 672077 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sw
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 41986 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_th
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_th')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6064129 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tt
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 135923 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ur
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 638596 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3366 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xal
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 39 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yue
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 11 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_en
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_en')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 455994980 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 506883 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_frr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 544388 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_he
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_he')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3808397 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ht
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 13 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_id
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_id')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 16236463 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_is
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_is')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 625673 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1445 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 350363 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1549 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 34807 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 52910 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mai
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 123 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 437871 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mrj
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 757 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_my
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_my')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 232329 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nap
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 73 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 34682142 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_or
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_or')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 59463 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 35440972 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pt
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 42114520 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ru
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 161836003 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sd
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 44280 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1746604 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_su
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_su')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 805 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_te
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_te')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 475703 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 458206 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ug
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 22255 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vec
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 73 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_war
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_war')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9760 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 59364 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}