Références :
unshuffled_deduplicated_af
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 130640 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_als
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 4518 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_arz
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 79928 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_an
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2025 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ast
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 5343 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ba
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 27050 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_am
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 43102 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_as
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 9212 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_azb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 9985 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_be
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 307405 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 15762 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bxr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 36 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ceb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 26145 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_az
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 626796 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bcl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cy
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 98225 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dsb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 37 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1114481 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bs
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 702 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ce
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2984 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 10130 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_diq
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eml
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 80 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_et
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1172041 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bg
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3398679 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bpy
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1770 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ca
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2458067 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ckb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 68210 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ar
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 9006977 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_av
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
- Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 360 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_Dedupliated_bar
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 4 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_bh
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 82 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_br
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 14724 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_cbk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_da
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 4771098 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_dv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 17024 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_eo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 84752 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_fa
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 8203495 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_fy
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 20661 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_gn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 68 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_Dedupliated_cs
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 12308039 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_hi
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1909387 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_Dedupliated_hu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 6582908 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_ie
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 11 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_fr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 59448891 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_Dedupliated_gd
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3883 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_gu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 169834 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_hsb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3084 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_ia
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 529 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_io
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 617 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_jbo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 617 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_km
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 108346 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_Dedupliated_ku
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 29054 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_la
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 18808 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_lmo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1374 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_lv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 843195 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_Dedupliated_min
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 166 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliqué_mr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 212556 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_mwl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:
- Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
- Identifiez clairement le travail protégé par le droit d'auteur prétendu.
- Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.
Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
non taillé_dedupliated_nah
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 58 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2126 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 6485 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 67921 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 28522082 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 372158 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 5044757 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 17 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3675420 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 68 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1381 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 72 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 13343 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 453904 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 183443 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 5 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 8714 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 109118 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2559 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pms
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2859 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_qu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 411 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sa
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7121 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2820821 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sh
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 17610 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_so
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 42 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 645747 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ta
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 833101 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 4694 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tyv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 24 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uz
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 15074 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wa
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 677 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xmf
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2418 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 11014487 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tg
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 56259 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_de
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 62398034 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 11596446 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_el
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 6521169 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7782375 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vi
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 9897709 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wuu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 64 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 49 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_als
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_als')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7324 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_arz
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 158113 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_az
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_az')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 912330 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bcl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1675515 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bs
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2143 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ce
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 4042 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 20281 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_diq
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eml
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 84 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_et
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_et')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2093621 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_zh
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 41708901 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_an
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_an')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2449 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ast
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 6999 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ba
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 42551 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bg
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 5869686 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bpy
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 6046 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ca
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 4390754 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ckb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 103639 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_es
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 56326016 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_da
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_da')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7664010 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 21018 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 121168 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fi
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 5326443 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ga
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 46493 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gom
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 484 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 321484 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hy
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 396093 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ilo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1578 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fa
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 13704702 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fy
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 33053 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 106 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hi
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3264660 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 11197780 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ie
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 101 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ja
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 39496439 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 338073 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_krc
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1377 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ky
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 86561 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_li
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 118 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lt
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1737411 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mhr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2515 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 197878 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mt
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 16383 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mzn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 917 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ne
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 219334 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_no
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3229940 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pa
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 87235 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pnb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3463 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_rm
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 34 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sah
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 8555 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_si
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 120684 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sq
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 461598 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sw
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 24803 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_th
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3749826 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tt
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 82738 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ur
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 428674 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3317 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xal
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 36 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yue
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_am
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_am')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 83663 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_as
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_as')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 14985 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_azb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 15446 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_be
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_be')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 586031 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 26795 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bxr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 42 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ceb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 56248 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cy
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 157698 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dsb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 65 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 96742378 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gd
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 5799 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 240691 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hsb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7959 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ia
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1040 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_io
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_io')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 694 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jbo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 832 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_km
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_km')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 159363 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ku
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 46535 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_la
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_la')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 94588 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lmo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1401 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1593820 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_min
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_min')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 220 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 326804 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mwl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 8 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nah
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 61 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_new
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_new')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 4696 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_oc
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 10709 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pam
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ps
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 98216 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ro
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 9387265 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_scn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 21 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 5492194 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1013619 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ta
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1263280 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 6456 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tyv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 34 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uz
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 27537 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wa
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1001 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xmf
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3783 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_it
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_it')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 46981781 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ka
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 563916 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ko
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7345075 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kw
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 203 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lez
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1485 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lrc
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 88 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mg
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 17957 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ml
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 603937 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ms
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 534016 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_myv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 6 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nds
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 18174 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 185884 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_os
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_os')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 5213 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pms
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3225 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_qu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 452 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sa
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 14291 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sh
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 36700 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_so
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_so')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 156 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 17395625 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tg
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 89002 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 18535253 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 12973467 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vi
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 14898250 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wuu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 214 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 214 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_zh
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 60137667 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_en
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 304230423 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 256513 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_frr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 284320 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_he
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2375030 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ht
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 9 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_id
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 9948521 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_is
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 389515 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1163 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 251064 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 924 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 21735 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 32652 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mai
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 25 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 299457 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mrj
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 669 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_my
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 136639 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nap
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 55 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 20812149 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_or
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 44230 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 20682611 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pt
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 26920397 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ru
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 115954598 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sd
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 33925 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 886223 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_su
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 511 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_te
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 312644 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 294132 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ug
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 15503 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vec
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 64 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_war
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 9161 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yi
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 32919 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_af
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_af')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 201117 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ar
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 16365602 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_av
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_av')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 456 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bar
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 4 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bh
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 336 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_br
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_br')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 37085 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cbk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cs
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 21001388 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_de
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_de')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 104913504 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_el
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_el')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 10425596 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_es
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_es')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 88199221 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fi
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 8557453 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ga
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 83223 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gom
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 640 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 582219 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hy
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 659430 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ilo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2638 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ja
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 62721527 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 524591 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_krc
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1581 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ky
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 146993 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_li
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_li')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 137 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lt
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 2977757 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mhr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3212 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 395605 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mt
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 26598 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mzn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1055 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ne
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 299938 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_no
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_no')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 5546211 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pa
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 127467 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pnb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 4599 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_rm
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 41 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sah
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 22301 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_si
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_si')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 203082 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sq
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 672077 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sw
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 41986 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_th
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_th')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 6064129 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tt
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 135923 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ur
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 638596 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3366 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xal
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 39 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yue
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 11 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_en
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_en')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 455994980 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eu
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 506883 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_frr
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 7 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 544388 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_he
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_he')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 3808397 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ht
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 13 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_id
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_id')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 16236463 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_is
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_is')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 625673 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1445 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kn
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 350363 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kv
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1549 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lb
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 34807 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lo
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 52910 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mai
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 123 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mk
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 437871 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mrj
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 757 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_my
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_my')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 232329 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nap
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 73 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 34682142 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_or
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_or')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 59463 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 35440972 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pt
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 42114520 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ru
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 161836003 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sd
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 44280 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 1746604 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_su
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_su')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 805 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_te
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_te')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 475703 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tl
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 458206 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ug
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 22255 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vec
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 73 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_war
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_war')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 9760 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yi
Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :
ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divisions :
Diviser | Exemples |
---|---|
'train' | 59364 |
- Caractéristiques :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}