Оскар

Ссылки:

unshuffled_dedupliced_af

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 130640
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_als

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 4518
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduulated_arz

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 79928
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduulated_an

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2025 год
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_ast

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 5343
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_ba

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 27050
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduulated_am

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 43102
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_as

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 9212
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_azb

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 9985
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_be

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 307405
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_bo

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 15762
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_bxr

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 36
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_ceb

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 26145
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_az

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 626796
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_bcl

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_cy

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 98225
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_dsb

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко обозначьте произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 37
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_bn

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1114481
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_bs

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 702
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_ce

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2984
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_cv

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 10130
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_diq

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduulated_eml

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 80
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_et

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1172041
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_bg

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 3398679
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_bpy

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1770 г.
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_ca

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2458067
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduulated_ckb

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко обозначьте произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 68210
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_ar

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко укажите материал, который, как утверждается, нарушает авторские права, и информацию, достаточную для того, чтобы мы могли обнаружить этот материал.

    Мы выполним законные запросы, удалив затронутые источники из следующей версии корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 9006977
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_av

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : Эти данные публикуются по этой схеме лицензирования. Мы не владеем текстом, из которого были извлечены эти данные. Мы лицензируем фактическую упаковку этих данных по лицензии Creative Commons CC0 («права не защищены») http://creativecommons.org/publicdomain/zero/1.0/ Насколько это возможно по закону, Inria отказалась от всех авторских прав и связанных с ними или смежные права с ОСКАР. Эта работа опубликована в: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вам и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Четко идентифицируйте себя, указав подробные контактные данные, такие как адрес, номер телефона или адрес электронной почты, по которым с вами можно связаться.
    • Четко укажите произведение, защищенное авторским правом, которое, как утверждается, было нарушено.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 360
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_bar

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 4
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_bh

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 82
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_br

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 14724
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_cbk

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_da

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 4771098
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_dv

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 17024
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_eo

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 84752
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_fa

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 8203495
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_fy

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 20661
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_gn

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 68
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_cs

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 12308039
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_hi

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1909387
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_hu

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 6582908
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_ie

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 11
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_fr

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 59448891
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_gd

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 3883
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_gu

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 169834
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_hsb

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 3084
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_ia

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 529
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_io

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 617
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_jbo

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 617
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_km

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 108346
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_ku

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 29054
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_la

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 18808
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_lmo

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1374
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_lv

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 843195
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_min

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 166
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_mr

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 212556
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_mwl

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Если вы считаете, что наши данные содержат материал, который принадлежит вами и поэтому не должен воспроизводиться здесь, пожалуйста:

    • Ясно идентифицируйте себя, с подробными контактными данными, такими как адрес, номер телефона или адрес электронной почты, с которыми можно связаться.
    • Четко определить, что авторское право, утверждаемое нарушением.
    • Четко определить материал, который, как утверждается, нарушает, и информация достаточно достаточно, чтобы позволить нам найти материал.

    Мы будем выполнять законные запросы, удалив затронутые источники из следующего выпуска корпуса.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 7
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshaffled_deduplicate_nah

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Лицензия : эти данные выпускаются в соответствии с этой схемой лицензирования. У нас нет ни одного из текста, из которого эти данные были извлечены. Мы лицензируем фактическую упаковку этих данных в соответствии с лицензией Creative Commons CC0 («Без права защищены») http://creativecommons.org/publicdomain/zero/1.0/ В той степени, возможно Соседние права на Оскар. Эта работа опубликована от: Франция.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 58
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_new

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2126
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_oc

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 6485
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pam

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ps

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 67921
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_it

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 28522082
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ka

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 372158
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ro

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 5044757
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_scn

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 17
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ko

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 3675420
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kw

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 68
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lez

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1381
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lrc

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 72
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mg

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 13343
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ml

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 453904
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ms

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 183443
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_myv

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 5
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nds

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 8714
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nn

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 109118
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_os

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2559
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pms

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2859
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_qu

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 411
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sa

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 7121
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sk

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2820821
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sh

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 17610
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_so

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 42
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sr

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 645747
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ta

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 833101
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tk

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 4694
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tyv

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 24
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uz

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 15074
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wa

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 677
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xmf

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2418
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sv

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 11014487
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tg

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 56259
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_de

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 62398034
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tr

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 11596446
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_el

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 6521169
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uk

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 7782375
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vi

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 9897709
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wuu

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 64
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yo

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 49
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_als

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 7324
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_arz

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 158113
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_az

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 912330
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bcl

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bn

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1675515
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bs

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2143
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ce

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 4042
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cv

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 20281
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_diq

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eml

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 84
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_et

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2093621
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_zh

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 41708901
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_an

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 2449
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ast

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 6999
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ba

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 42551
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bg

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 5869686
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bpy

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 6046
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ca

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 4390754
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ckb

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 103639
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_es

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 56326016
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_da

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 7664010
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dv

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 21018
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eo

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 121168
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fi

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 5326443
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ga

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 46493
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gom

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 484
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hr

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 321484
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hy

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 396093
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ilo

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 1578 г.
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fa

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 13704702
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fy

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 33053
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gn

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 106
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hi

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 3264660
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hu

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 11197780
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ie

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 101
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ja

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 39496439
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kk

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 338073
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_krc

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1377
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ky

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 86561
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_li

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 118
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1737411
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mhr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 2515
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 197878
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mt

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 16383
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mzn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 917
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ne

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 219334
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_no

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3229940
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pa

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 87235
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pnb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3463
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_rm

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 34
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sah

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 8555
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_si

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 120684
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sq

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 461598
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sw

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 24803
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_th

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3749826
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 82738
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ur

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 428674
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vo

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3317
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xal

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 36
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yue

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 7
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_am

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 83663
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_as

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 14985
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_azb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 15446
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_be

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 586031
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 26795
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bxr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 42
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ceb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 56248
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cy

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 157698
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dsb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 65
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 96742378
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gd

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 5799
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gu

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 240691
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hsb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 7959
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ia

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1040
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_io

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 694
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jbo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 832
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_km

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 159363
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ku

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 46535
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_la

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 94588
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lmo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1401
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1593820
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_min

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 220
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 326804
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mwl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 8
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nah

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 61
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_new

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 4696
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_oc

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 10709
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pam

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ps

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 98216
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ro

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 9387265
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_scn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 21
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 5492194
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1013619
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ta

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1263280
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 6456
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tyv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Splits :

Расколоть Примеры
'train' 34
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uz

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 27537
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wa

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1001
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xmf

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3783
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_it

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 46981781
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ka

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 563916
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ko

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 7345075
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kw

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 203
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lez

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1485
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lrc

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 88
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mg

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 17957
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ml

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 603937
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ms

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 534016
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_myv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 6
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nds

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 18174
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 185884
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_os

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 5213
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pms

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3225
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_qu

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 452
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sa

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 14291
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sh

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 36700
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_so

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 156
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Splits :

Расколоть Примеры
'train' 17395625
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tg

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 89002
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 18535253
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 12973467
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vi

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 14898250
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wuu

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 214
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 214
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_zh

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 60137667
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_en

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 304230423
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eu

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 256513
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_frr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 7
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 284320
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_he

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 2375030
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ht

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 9
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_id

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 9948521
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_is

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 389515
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1163
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 251064
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 924
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 21735
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 32652
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mai

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 25
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 299457
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mrj

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 669
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_my

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 136639
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nap

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 55
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nl

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 20812149
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_or

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 44230
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 20682611
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 26920397
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ru

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 115954598
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sd

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 33925
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 886223
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_su

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Splits :

Расколоть Примеры
'train' 511
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_te

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 312644
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 294132
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ug

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 15503
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vec

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 64
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_war

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 9161
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yi

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 32919
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_af

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 201117
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ar

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 16365602
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_av

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 456
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bar

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 4
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bh

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 336
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_br

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 37085
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cbk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cs

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Splits :

Расколоть Примеры
'train' 21001388
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_de

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 104913504
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_el

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 10425596
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_es

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 88199221
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fi

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 8557453
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ga

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 83223
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gom

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 640
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 582219
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hy

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 659430
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ilo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 2638
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ja

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 62721527
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 524591
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_krc

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1581 г.
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ky

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 146993
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_li

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 137
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 2977757
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mhr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3212
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 395605
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mt

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 26598
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mzn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1055
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ne

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 299938
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_no

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 5546211
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pa

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Версия : 1.0.0

  • Splits :

Расколоть Примеры
'train' 127467
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pnb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 4599
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_rm

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 41
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sah

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 22301
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_si

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 203082
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sq

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 672077
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sw

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 41986
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_th

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 6064129
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 135923
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ur

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 638596
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3366
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xal

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 39
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yue

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 11
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_en

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 455994980
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eu

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 506883
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_frr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 7
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 544388
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_he

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 3808397
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ht

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 13
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_id

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 16236463
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_is

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 625673
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jv

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1445
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 350363
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1549
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 34807
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 52910
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mai

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 123
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 437871
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mrj

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Расколы :

Расколоть Примеры
'train' 757
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_my

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 232329
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nap

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 73
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 34682142
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_or

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 59463
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pl

Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 35440972
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 42114520
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ru

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 161836003
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sd

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 44280
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 1746604
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_su

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 805
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_te

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 475703
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 458206
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ug

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 22255
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vec

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 73
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_war

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 9760
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yi

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • Описание :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Расколоть Примеры
'train' 59364
  • Функции :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}