Oskar

Referencje:

unshuffled_deduplicated_af

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 130640
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_als

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 4518
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_arz

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 79928
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_an

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2025
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ast

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 5343
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ba

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 27050
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_am

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 43102
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_as

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 9212
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_azb

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 9985
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_be

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 307405
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 15762
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bxr

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 36
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ceb

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 26145
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_az

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 626796
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bcl

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cy

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 98225
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_dsb

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 37
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bn

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1114481
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bs

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 702
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ce

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2984
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cv

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 10130
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_diq

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eml

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 80
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_et

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1172041
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bg

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3398679
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bpy

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1770
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ca

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2458067
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ckb

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 68210
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ar

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Jasno określ materiał, który rzekomo narusza prawo, oraz informacje wystarczające, abyśmy mogli zlokalizować materiał.

    Spełnimy uzasadnione żądania, usuwając dotknięte źródła z następnej wersji korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 9006977
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_av

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są udostępniane w ramach niniejszego programu licencyjnego. Nie jesteśmy właścicielami żadnego tekstu, z którego pobrano te dane. Licencjonujemy faktyczne opakowanie tych danych w ramach licencji Creative Commons CC0 („żadne prawa zastrzeżone”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie dozwolonym przez prawo Inria zrzekła się wszelkich praw autorskich i pokrewnych lub prawa pokrewne do OSCAR Ta praca została opublikowana w: Francja.

    Jeżeli uznają Państwo, że nasze dane zawierają materiały będące Państwa własnością i w związku z tym nie powinny być tutaj powielane, prosimy:

    • Wyraźnie się identyfikuj, podając szczegółowe dane kontaktowe, takie jak adres, numer telefonu lub adres e-mail, pod którym można się z Tobą skontaktować.
    • Wyraźnie wskaż dzieło chronione prawem autorskim, które rzekomo zostało naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 360
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_Bar

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 4
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_Bh

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 82
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_Br

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 14724
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_cbk

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_da

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 4771098
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_dv

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 17024
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_Eo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 84752
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_FA

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 8203495
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UnShuffled_Dedupliced_fy

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 20661
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_gn

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 68
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_cs

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 12308039
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_HI

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1909387
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_Hu

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 6582908
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_ie

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 11
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_fr

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 59448891
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_GD

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3883
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_Gu

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 169834
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_HSB

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3084
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_ia

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 529
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_IO

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 617
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_jbo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 617
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_KM

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 108346
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_KU

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 29054
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_LA

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 18808
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_Lmo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1374
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_Lv

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 843195
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_min

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 166
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Dedupliced_Mr

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 212556
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unShuffled_Deduprived_Mwl

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Jeśli weźmiesz pod uwagę, że nasze dane zawierają materiały, które są własnością i dlatego nie powinny być tutaj powielane, proszę:

    • Zidentyfikuj się z szczegółowymi danymi kontaktowymi, takimi jak adres, numer telefonu lub adres e -mail, na którym można się skontaktować.
    • Wyraźnie zidentyfikuj pracę chronioną prawem autorskim, które twierdziły, że są naruszone.
    • Wyraźnie zidentyfikuj materiał, który, jak się twierdzi, narusza naruszanie i informacje wystarczające, aby umożliwić nam zlokalizowanie materiału.

    Będziemy przestrzegać uzasadnionych wniosków, usuwając dotknięte źródła z następnego wydania korpusu.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 7
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_depliced_nah

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licencja : Dane te są opublikowane w ramach tego schematu licencjonowania, nie posiadamy żadnego tekstu, z którego wyodrębniono te dane. Użytkowamy faktyczne opakowanie tych danych na podstawie licencji Creative Comons CC0 („Bez praw zastrzeżonych”) http://creativecommons.org/publicdomain/zero/1.0/ W zakresie, w jakim jest to możliwe zgodnie z prawem, Inria zrezygnowała z wszystkich praw autorskich i powiązanych lub powiązanych lub powiązanych lub powiązanych lub powiązanych z nimi Sąsiednie prawa do Oscara Praca ta została opublikowana z: Francji.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 58
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_new

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2126
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_oc

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 6485
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pam

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ps

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 67921
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_it

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 28522082
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ka

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 372158
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ro

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 5044757
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_scn

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 17
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ko

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3675420
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kw

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 68
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lez

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1381
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lrc

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 72
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mg

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 13343
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ml

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 453904
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ms

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 183443
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_myv

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 5
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nds

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 8714
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nn

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 109118
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_os

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2559
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pms

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2859
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_qu

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 411
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sa

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 7121
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sk

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2820821
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sh

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 17610
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_so

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 42
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sr

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 645747
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ta

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 833101
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tk

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 4694
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tyv

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 24
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uz

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 15074
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wa

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 677
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xmf

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2418
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sv

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 11014487
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tg

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 56259
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_de

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 62398034
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tr

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 11596446
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_el

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 6521169
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uk

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 7782375
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vi

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 9897709
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wuu

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 64
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 49
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_als

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 7324
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_arz

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 158113
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_az

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 912330
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bcl

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bn

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1675515
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bs

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2143
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ce

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 4042
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cv

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 20281
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_diq

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eml

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 84
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_et

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2093621
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_zh

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 41708901
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_an

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2449
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ast

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 6999
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ba

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 42551
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bg

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 5869686
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bpy

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 6046
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ca

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 4390754
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ckb

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 103639
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_es

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 56326016
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_da

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 7664010
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dv

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 21018
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 121168
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fi

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 5326443
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ga

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 46493
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gom

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 484
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hr

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 321484
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hy

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 396093
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ilo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1578
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fa

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 13704702
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fy

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 33053
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gn

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 106
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hi

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3264660
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hu

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 11197780
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ie

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 101
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ja

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 39496439
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kk

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 338073
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_krc

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1377
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ky

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 86561
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_li

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 118
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lt

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1737411
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mhr

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2515
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mn

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 197878
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mt

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 16383
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mzn

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 917
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ne

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 219334
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_no

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3229940
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pa

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 87235
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pnb

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3463
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_rm

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 34
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sah

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 8555
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_si

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 120684
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sq

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 461598
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sw

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 24803
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_th

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3749826
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tt

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 82738
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ur

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 428674
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3317
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xal

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 36
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yue

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 7
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_am

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 83663
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_as

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 14985
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_azb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 15446
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_be

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 586031
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 26795
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bxr

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 42
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ceb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 56248
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cy

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 157698
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dsb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 65
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 96742378
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gd

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 5799
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gu

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 240691
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hsb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 7959
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ia

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1040
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_io

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 694
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jbo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 832
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_km

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 159363
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ku

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 46535
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_la

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 94588
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lmo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1401
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1593820
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_min

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 220
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 326804
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mwl

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 8
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nah

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 61
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_new

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 4696
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_oc

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 10709
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pam

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ps

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 98216
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ro

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 9387265
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_scn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 21
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 5492194
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1013619
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ta

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1263280
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tk

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 6456
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tyv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 34
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uz

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 27537
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wa

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1001
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xmf

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3783
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_it

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 46981781
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ka

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 563916
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ko

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 7345075
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kw

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 203
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lez

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1485
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lrc

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 88
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mg

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 17957
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ml

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 603937
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ms

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 534016
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_myv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 6
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nds

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 18174
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 185884
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_os

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 5213
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pms

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3225
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_qu

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 452
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sa

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 14291
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sh

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 36700
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_so

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 156
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 17395625
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tg

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 89002
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 18535253
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 12973467
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vi

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 14898250
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wuu

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 214
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 214
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_zh

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 60137667
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_en

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 304230423
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eu

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 256513
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_frr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 7
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 284320
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_he

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2375030
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ht

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 9
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_id

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 9948521
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_is

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 389515
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1163
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 251064
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 924
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 21735
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 32652
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mai

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 25
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 299457
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mrj

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 669
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_my

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 136639
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nap

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 55
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 20812149
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_or

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 44230
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 20682611
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 26920397
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ru

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 115954598
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sd

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 33925
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 886223
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_su

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 511
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_te

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 312644
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 294132
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ug

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 15503
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vec

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 64
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_war

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 9161
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yi

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 32919
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_af

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 201117
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ar

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 16365602
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_av

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 456
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bar

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 4
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bh

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 336
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_br

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 37085
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cbk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cs

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 21001388
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_de

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 104913504
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_el

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 10425596
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_es

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 88199221
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fi

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 8557453
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ga

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 83223
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gom

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 640
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 582219
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hy

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 659430
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ilo

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2638
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ja

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 62721527
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kk

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 524591
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_krc

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1581
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ky

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 146993
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_li

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 137
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 2977757
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mhr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 3212
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 395605
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 26598
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mzn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 1055
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ne

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 299938
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_no

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 5546211
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pa

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 127467
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pnb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 4599
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_rm

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 41
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sah

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 22301
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_si

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 203082
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sq

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 672077
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sw

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 41986
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_th

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 6064129
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 135923
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ur

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 638596
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 3366
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xal

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 39
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yue

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 11
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_en

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 455994980
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eu

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 506883
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_frr

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 7
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 544388
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_he

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 3808397
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ht

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 13
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_id

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 16236463
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_is

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 625673
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 1445
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kn

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 350363
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kv

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 1549
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lb

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 34807
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lo

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 52910
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mai

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 123
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mk

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 437871
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mrj

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 757
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_my

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 232329
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nap

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 73
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 34682142
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_or

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 59463
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 35440972
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pt

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 42114520
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ru

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 161836003
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sd

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 44280
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 1746604
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_su

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Przykłady
'train' 805
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_te

Użyj następującego polecenia, aby załadować ten zestaw danych do TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 475703
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tl

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 458206
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ug

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 22255
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vec

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 73
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_war

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 9760
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yi

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • Opis :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Wersja : 1.0.0

  • Podziały :

Podział Examples
'train' 59364
  • Cechy :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}