Ссылки:
unshuffled_dedupliced_af
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 130640 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_als
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 4518 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduulated_arz
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 79928 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduulated_an
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2025 год |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_ast
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 5343 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_ba
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 27050 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduulated_am
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 43102 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_as
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 9212 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduulated_azb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 9985 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduulated_be
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 307405 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_bo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 15762 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_bxr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 36 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_ceb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 26145 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduulated_az
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко обозначьте произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 626796 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_bcl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduulated_cy
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 98225 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_dsb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 37 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_bn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1114481 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_bs
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 702 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduulated_ce
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2984 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_cv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко обозначьте произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 10130 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_diq
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduulated_eml
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 80 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_et
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1172041 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_bg
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3398679 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_bpy
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1770 г. |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_ca
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2458067 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_ckb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 68210 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_ar
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.
Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 9006977 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_av
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:
- Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
- Четко обозначьте произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 360 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_bar
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 4 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_bh
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 82 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_br
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 14724 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_cbk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_da
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 4771098 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_dv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 17024 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_eo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 84752 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_fa
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 8203495 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_fy
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 20661 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_gn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 68 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_cs
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 12308039 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_hi
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1909387 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_hu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 6582908 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_ie
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 11 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_fr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 59448891 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_gd
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3883 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_gu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 169834 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_hsb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3084 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_ia
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 529 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_io
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 617 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_jbo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 617 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_km
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 108346 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_ku
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 29054 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_la
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 18808 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_lmo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1374 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_lv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 843195 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_min
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 166 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_mr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 212556 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_mwl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
- Четко определить, что авторское право, утверждаемое нарушением.
- Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.
Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshaffled_deduplicate_nah
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.
Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 58 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2126 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 6485 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 67921 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 28522082 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 372158 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 5044757 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 17 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3675420 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 68 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1381 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 72 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 13343 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 453904 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 183443 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 5 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 8714 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 109118 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2559 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pms
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2859 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_qu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 411 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sa
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7121 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2820821 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sh
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 17610 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_so
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 42 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 645747 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ta
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 833101 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 4694 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tyv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 24 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uz
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 15074 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wa
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 677 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xmf
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2418 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 11014487 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tg
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 56259 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_de
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 62398034 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 11596446 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_el
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 6521169 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7782375 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vi
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 9897709 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wuu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 64 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 49 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_als
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_als')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7324 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_arz
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 158113 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_az
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_az')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 912330 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bcl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1675515 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bs
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2143 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ce
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 4042 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 20281 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_diq
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eml
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 84 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_et
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_et')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2093621 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_zh
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 41708901 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_an
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_an')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2449 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ast
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 6999 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ba
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 42551 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bg
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 5869686 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bpy
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 6046 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ca
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 4390754 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ckb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 103639 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_es
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 56326016 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_da
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_da')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7664010 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 21018 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 121168 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fi
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 5326443 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ga
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 46493 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gom
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 484 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 321484 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hy
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 396093 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ilo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1578 г. |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fa
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 13704702 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fy
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 33053 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 106 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hi
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3264660 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 11197780 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ie
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 101 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ja
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 39496439 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 338073 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_krc
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1377 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ky
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 86561 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_li
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 118 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lt
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1737411 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mhr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2515 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 197878 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mt
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 16383 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mzn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 917 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ne
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 219334 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_no
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3229940 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pa
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 87235 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pnb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3463 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_rm
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 34 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sah
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 8555 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_si
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 120684 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sq
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 461598 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sw
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 24803 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_th
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3749826 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tt
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 82738 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ur
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 428674 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3317 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xal
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 36 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yue
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_am
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_am')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 83663 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_as
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_as')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 14985 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_azb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 15446 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_be
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_be')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 586031 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 26795 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bxr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 42 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ceb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 56248 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cy
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 157698 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dsb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 65 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 96742378 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gd
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 5799 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 240691 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hsb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7959 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ia
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1040 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_io
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_io')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 694 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jbo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 832 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_km
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_km')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 159363 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ku
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 46535 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_la
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_la')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 94588 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lmo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1401 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1593820 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_min
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_min')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 220 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 326804 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mwl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 8 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nah
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 61 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_new
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_new')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 4696 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_oc
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 10709 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pam
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ps
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 98216 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ro
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 9387265 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_scn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 21 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 5492194 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1013619 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ta
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1263280 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 6456 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tyv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 34 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uz
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 27537 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wa
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1001 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xmf
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3783 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_it
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_it')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 46981781 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ka
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 563916 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ko
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7345075 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kw
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 203 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lez
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1485 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lrc
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 88 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mg
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 17957 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ml
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 603937 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ms
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 534016 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_myv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 6 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nds
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 18174 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 185884 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_os
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_os')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 5213 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pms
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3225 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_qu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 452 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sa
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 14291 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sh
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 36700 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_so
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_so')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 156 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 17395625 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tg
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 89002 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 18535253 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 12973467 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vi
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 14898250 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wuu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 214 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 214 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_zh
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 60137667 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_en
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 304230423 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 256513 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_frr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 284320 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_he
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2375030 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ht
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 9 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_id
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 9948521 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_is
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 389515 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1163 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 251064 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 924 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 21735 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 32652 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mai
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 25 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 299457 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mrj
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 669 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_my
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 136639 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nap
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 55 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 20812149 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_or
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 44230 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 20682611 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pt
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 26920397 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ru
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 115954598 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sd
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 33925 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 886223 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_su
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 511 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_te
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 312644 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 294132 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ug
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 15503 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vec
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 64 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_war
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 9161 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yi
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 32919 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_af
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_af')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 201117 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ar
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 16365602 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_av
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_av')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 456 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bar
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 4 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bh
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 336 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_br
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_br')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 37085 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cbk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cs
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 21001388 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_de
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_de')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 104913504 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_el
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_el')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 10425596 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_es
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_es')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 88199221 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fi
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 8557453 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ga
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 83223 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gom
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 640 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 582219 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hy
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 659430 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ilo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2638 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ja
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 62721527 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 524591 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_krc
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1581 г. |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ky
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 146993 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_li
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_li')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 137 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lt
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 2977757 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mhr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3212 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 395605 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mt
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 26598 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mzn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1055 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ne
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 299938 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_no
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_no')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 5546211 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pa
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 127467 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pnb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 4599 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_rm
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 41 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sah
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 22301 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_si
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_si')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 203082 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sq
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 672077 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sw
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 41986 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_th
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_th')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 6064129 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tt
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 135923 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ur
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 638596 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3366 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xal
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 39 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yue
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 11 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_en
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_en')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Версия : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 455994980 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eu
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 506883 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_frr
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 7 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 544388 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_he
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_he')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 3808397 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ht
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 13 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_id
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_id')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 16236463 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_is
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_is')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 625673 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1445 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kn
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 350363 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kv
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1549 г. |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lb
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 34807 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lo
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 52910 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mai
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 123 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mk
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 437871 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mrj
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 757 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_my
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_my')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 232329 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nap
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 73 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 34682142 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_or
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_or')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 59463 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 35440972 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pt
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 42114520 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ru
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 161836003 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sd
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 44280 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 1746604 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_su
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_su')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 805 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_te
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_te')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 475703 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tl
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 458206 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ug
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 22255 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vec
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 73 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_war
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_war')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 9760 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yi
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
- Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Расколы :
Расколоть | Примеры |
---|---|
'train' | 59364 |
- Функции :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}