مراجع:
unshuffled_deduplicated_af
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 130640 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_als
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 4518 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_arz
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 79928 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_an
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2025 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ast
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 5343 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ba
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 27050 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_am
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 43102 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_as
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 9212 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_azb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 9985 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_be
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 307405 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 15762 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bxr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 36 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ceb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 26145 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_az
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 626796 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bcl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cy
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 98225 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dsb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 37 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1114481 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bs
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 702 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ce
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2984 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
با حذف منابع آسیبدیده از نسخه بعدی مجموعه، از درخواستهای قانونی پیروی میکنیم.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 10130 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_diq
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این دادهها تحت این طرح مجوز منتشر میشوند. ما مالک هیچ یک از متنهایی نیستیم که این دادهها از آن استخراج شدهاند. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eml
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 80 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_et
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
با حذف منابع آسیبدیده از نسخه بعدی مجموعه، از درخواستهای قانونی پیروی میکنیم.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1172041 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bg
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 3398679 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bpy
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1770 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ca
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2458067 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ckb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 68210 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ar
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- مطالبی را که ادعا میشود نقضکننده حقوق است و اطلاعاتی که بهطور معقولی برای یافتن مطالب به ما امکان میدهد، به وضوح شناسایی کنید.
ما با حذف منابع آسیبدیده از نسخه بعدی مجموعه، درخواستهای قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 9006977 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_av
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح مجوز منتشر می شوند. ما مالک هیچ یک از متن هایی نیستیم که این داده ها از آن استخراج شده است. ما بستهبندی واقعی این دادهها را تحت مجوز Creative Commons CC0 مجوز میدهیم ("بدون حقوق محفوظ است") http://creativecommons.org/publicdomain/zero/1.0/ تا آنجایی که طبق قانون ممکن است، اینریا تمام حق نسخهبرداری و یا مربوط به آن را لغو کرده است. حقوق همسایگی OSCAR این اثر از: فرانسه منتشر شده است.
اگر فکر می کنید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا بازتولید شود، لطفاً:
- با اطلاعات تماس دقیق مانند آدرس، شماره تلفن یا آدرس ایمیلی که می توان با شما تماس گرفت، به وضوح خود را شناسایی کنید.
- اثر دارای حق نسخه برداری که ادعا می شود نقض شده است را به وضوح شناسایی کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 360 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bar
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 4 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bh
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 82 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_br
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 14724 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_cbk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_da
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 4771098 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_dv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 17024 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 84752 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_fa
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 8203495 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fy
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 20661 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 68 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cs
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 12308039 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hi
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1909387 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_hu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 6582908 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ie
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 11 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 59448891 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gd
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 3883 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 169834 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_hsb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 3084 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ia
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 529 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_io
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 617 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_jbo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 617 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_km
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 108346 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ku
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 29054 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_la
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 18808 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_lmo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1374 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 843195 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_min
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 166 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 212556 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_mwl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
اگر در نظر بگیرید که داده های ما حاوی مطالبی است که متعلق به شما است و بنابراین نباید در اینجا تکثیر شود ، لطفا:
- به طور واضح خود را با داده های تماس دقیق مانند آدرس ، شماره تلفن یا آدرس ایمیل که در آن می توانید با شما تماس بگیرید ، شناسایی کنید.
- به وضوح کار دارای حق چاپ را که ادعا می شود نقض شده است ، مشخص کنید.
- به روشنی مطالبی را که ادعا می شود نقض و اطلاعات کافی است ، مشخص کنید تا به ما اجازه دهد تا مواد را پیدا کنیم.
ما با از بین بردن منابع آسیب دیده از انتشار بعدی Corpus ، درخواست های قانونی را رعایت خواهیم کرد.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 7 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplated_nah
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
مجوز : این داده ها تحت این طرح صدور مجوز منتشر می شوند ، ما هیچ یک از متنی را که این داده ها از آن استخراج شده است ، نداریم. ما بسته بندی واقعی این داده ها را تحت مجوز Creative Commons CC0 ("هیچ حقوقی محفوظ") مجوز می دهیم . حقوق همسایه اسکار این اثر از: فرانسه منتشر شده است.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 58 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2126 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 6485 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 67921 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 28522082 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 372158 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 5044757 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 17 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 3675420 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 68 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1381 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 72 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 13343 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 453904 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 183443 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 5 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 8714 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 109118 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2559 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pms
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2859 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_qu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 411 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sa
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 7121 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2820821 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sh
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 17610 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_so
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 42 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 645747 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ta
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 833101 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 4694 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tyv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 24 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uz
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 15074 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wa
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 677 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xmf
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2418 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 11014487 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tg
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 56259 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_de
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 62398034 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 11596446 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_el
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 6521169 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 7782375 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vi
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 9897709 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wuu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 64 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 49 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_als
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_als')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 7324 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_arz
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 158113 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_az
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_az')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 912330 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bcl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1675515 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bs
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2143 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ce
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 4042 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 20281 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_diq
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eml
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 84 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_et
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_et')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2093621 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_zh
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 41708901 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_an
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_an')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2449 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ast
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 6999 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ba
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 42551 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bg
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 5869686 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bpy
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 6046 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ca
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 4390754 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ckb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 103639 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_es
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 56326016 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_da
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_da')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 7664010 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 21018 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 121168 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fi
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 5326443 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ga
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 46493 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gom
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 484 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 321484 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hy
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 396093 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ilo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1578 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fa
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 13704702 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fy
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 33053 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 106 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hi
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 3264660 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 11197780 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ie
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 101 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ja
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 39496439 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 338073 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_krc
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1377 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ky
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 86561 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_li
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 118 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lt
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1737411 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mhr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2515 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 197878 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mt
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
تقسیم کنید | نمونه ها |
---|---|
'train' | 16383 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mzn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 917 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ne
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 219334 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_no
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 3229940 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pa
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 87235 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pnb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 3463 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_rm
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 34 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sah
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 8555 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_si
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 120684 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sq
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 461598 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sw
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 24803 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_th
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 3749826 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tt
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 82738 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ur
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 428674 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 3317 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xal
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 36 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yue
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 7 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_am
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_am')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 83663 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_as
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_as')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 14985 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_azb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 15446 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_be
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_be')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 586031 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 26795 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bxr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 42 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ceb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 56248 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cy
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 157698 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dsb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 65 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 96742378 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gd
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 5799 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 240691 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hsb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 7959 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ia
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 1040 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_io
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_io')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 694 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jbo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 832 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_km
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_km')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 159363 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ku
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 46535 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_la
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_la')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 94588 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lmo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1401 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1593820 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_min
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_min')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 220 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 326804 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mwl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 8 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nah
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 61 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_new
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_new')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 4696 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_oc
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 10709 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pam
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 3 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ps
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 98216 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ro
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 9387265 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_scn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 21 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 5492194 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1013619 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ta
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1263280 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 6456 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tyv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 34 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uz
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 27537 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wa
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1001 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xmf
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 3783 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_it
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_it')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 46981781 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ka
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 563916 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ko
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 7345075 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kw
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 203 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lez
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1485 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lrc
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 88 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mg
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 17957 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ml
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 603937 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ms
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 534016 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_myv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 6 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nds
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 18174 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 185884 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_os
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_os')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 5213 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pms
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 3225 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_qu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 452 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sa
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 14291 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sh
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 36700 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_so
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_so')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 156 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 17395625 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tg
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 89002 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 18535253 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 12973467 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vi
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 14898250 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wuu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 214 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 214 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_zh
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 60137667 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_en
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 304230423 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 256513 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_frr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 7 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 284320 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_he
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 2375030 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ht
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 9 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_id
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 9948521 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_is
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 389515 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1163 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 251064 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 924 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 21735 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 32652 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mai
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 25 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 299457 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mrj
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 669 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_my
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 136639 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nap
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 55 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 20812149 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_or
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 44230 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 20682611 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pt
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 26920397 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ru
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 115954598 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sd
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 33925 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 886223 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_su
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 511 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_te
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 312644 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 294132 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ug
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 15503 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vec
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 64 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_war
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 9161 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yi
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 32919 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_af
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_af')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 201117 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ar
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 16365602 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_av
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_av')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 456 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bar
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 4 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bh
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 336 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_br
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_br')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 37085 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cbk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cs
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 21001388 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_de
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_de')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 104913504 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_el
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_el')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 10425596 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_es
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_es')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 88199221 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fi
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 8557453 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ga
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 83223 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gom
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 640 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 582219 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hy
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 659430 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ilo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 2638 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ja
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 62721527 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 524591 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_krc
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1581 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ky
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 146993 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_li
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_li')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 137 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lt
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 2977757 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mhr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 3212 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 395605 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mt
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 26598 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mzn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1055 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ne
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 299938 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_no
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_no')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 5546211 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pa
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 127467 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pnb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 4599 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_rm
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 41 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sah
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 22301 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_si
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_si')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 203082 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sq
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 672077 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sw
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 41986 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_th
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_th')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 6064129 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tt
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 135923 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ur
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 638596 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 3366 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xal
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 39 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yue
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 11 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_en
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_en')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 455994980 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eu
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 506883 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_frr
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 7 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 544388 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_he
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_he')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 3808397 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ht
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 13 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_id
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_id')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 16236463 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_is
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_is')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 625673 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1445 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kn
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 350363 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kv
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 1549 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lb
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 34807 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lo
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 52910 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mai
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 123 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mk
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 437871 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mrj
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 757 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_my
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_my')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 232329 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nap
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 73 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
نسخه : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 34682142 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_or
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_or')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 59463 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 35440972 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pt
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 42114520 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ru
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 161836003 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sd
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 44280 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 1746604 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_su
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_su')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
تقسیم کنید | نمونه ها |
---|---|
'train' | 805 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_te
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_te')
- توضیحات :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 475703 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tl
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 458206 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ug
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
تقسیم ها :
Split | نمونه ها |
---|---|
'train' | 22255 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vec
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 73 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_war
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_war')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 9760 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yi
برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:
ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Splits :
Split | نمونه ها |
---|---|
'train' | 59364 |
- ویژگی ها :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}