Referencias:
todos los idiomas
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/all_languages')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1926192 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
si
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/af')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 307 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Arkansas
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ar')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 6446 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Arizona
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/az')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 624 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ser
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/be')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1512 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ber
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ber')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 67484 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
bg
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/bg')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 6324 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
mil millones
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/bn')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1440 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
hermano
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/br')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 2536 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
California
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ca')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 518 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
cbk
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/cbk')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 262 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
cmn
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/cmn')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 12549 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
cs
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/cs')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 6659 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
da
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/da')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 11220 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Delaware
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/de')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 125091 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
el
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/el')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 10072 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
es
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/en')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 158053 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
eo
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/eo')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 207105 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
es
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/es')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 85064 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
et
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/et')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 241 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UE
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/eu')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 573 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
fi
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/fi')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 31753 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
es
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/fr')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 116733 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
gl
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/gl')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 351 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
va
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/gos')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 279 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
él
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/he')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 68350 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
hola
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/hi')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1913 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
hora
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/hr')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 505 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
hu
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/hu')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 67964 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
hola
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/hy')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 603 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
I a
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ia')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 2548 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
identificación
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/id')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1602 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
es decir
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ie')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 488 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
yo
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/io')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 480 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
es
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/is')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1641 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
eso
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/it')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 198919 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
sí
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ja')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 44267 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
jbo
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/jbo')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 2704 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
kab
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/kab')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 15944 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ko
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ko')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 503 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
kilovatios
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/kw')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1328 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
la
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/la')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 6889 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
lfn
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/lfn')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 2313 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
es
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/lt')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 8042 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
mk
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/mk')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 14678 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Sres
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/mr')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 16413 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
nótese bien
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/nb')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1094 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
nds
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/nds')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 2633 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
nl
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/nl')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 23561 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
orv
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/orv')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 471 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ota
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ota')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 486 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
pes
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/pes')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 4285 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
pl
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/pl')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 22391 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
punto
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/pt')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 78430 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
rn
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/rn')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 648 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Ro
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ro')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 2092 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tu
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ru')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 251263 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
SL
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/sl')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 706 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
señor
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/sr')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 8175 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
sv
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/sv')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 7005 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
conocimientos tradicionales
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/tk')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1165 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tl
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/tl')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1017 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tlh
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/tlh')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 2804 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Toki
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/toki')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 3738 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tr
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/tr')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 142088 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tt
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/tt')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 2398 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ug
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ug')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 1183 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Reino Unido
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/uk')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 54431 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tu
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/ur')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 252 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
vi
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/vi')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 962 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
vo
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/vo')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 328 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
guerra
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/war')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 327 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
wuu
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/wuu')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 408 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
sí
Utilice el siguiente comando para cargar este conjunto de datos en TFDS:
ds = tfds.load('huggingface:tapaco/yue')
- Descripción :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Licencia : Creative Commons Atribución 2.0 Genérica
- Versión : 1.0.0
- Divisiones :
Separar | Ejemplos |
---|---|
'train' | 561 |
- Características :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}