References:
JRC
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/JRC')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
3410620 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
EMEA
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/EMEA')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
1221233 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
GlobalVoices
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/GlobalVoices')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
897075 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ECB
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/ECB')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
1875738 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
DOGC
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/DOGC')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
10917053 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
all_wikis
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/all_wikis')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
28109484 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
TED
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/TED')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
157910 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
multiUN
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/multiUN')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
13127490 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Europarl
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/Europarl')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
2174141 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
NewsCommentary11
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/NewsCommentary11')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
288771 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UN
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/UN')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
74067 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
EUBookShop
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/EUBookShop')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
8214959 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ParaCrawl
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/ParaCrawl')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
15510649 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
OpenSubtitles2018
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/OpenSubtitles2018')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
213508602 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
DGT
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/DGT')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
3168368 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
combined
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:large_spanish_corpus/combined')
- Description:
The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.
- License: MIT
- Version: 1.1.0
- Splits:
Split | Examples |
---|---|
'train' |
302656160 |
- Features:
{
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}