Ссылки:
все_языки
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/all_languages')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1926192 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
аф
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/af')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 307 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ар
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ar')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 6446 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
аз
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/az')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 624 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
быть
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/be')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1512 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
бер
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ber')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 67484 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
бг
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/bg')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 6324 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
млрд
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/bn')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1440 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
бр
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/br')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 2536 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Калифорния
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ca')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 518 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
КБК
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/cbk')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 262 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
смн
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/cmn')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 12549 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
CS
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/cs')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 6659 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
да
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/da')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 11220 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
де
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/de')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 125091 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
эль
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/el')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 10072 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ru
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/en')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 158053 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
эо
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/eo')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 207105 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
эс
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/es')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 85064 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
и др.
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/et')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 241 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Евросоюз
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/eu')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 573 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
фи
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/fi')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 31753 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
фр.
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/fr')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 116733 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
гл
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/gl')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 351 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
гос
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/gos')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 279 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
он
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/he')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 68350 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
привет
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/hi')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1913 год |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
час
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/hr')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 505 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ху
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/hu')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 67964 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
хи
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/hy')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 603 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
я
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ia')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 2548 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
идентификатор
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/id')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1602 г. |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
т.е.
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ie')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 488 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ио
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/io')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 480 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
является
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/is')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1641 г. |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
это
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/it')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 198919 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
да
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ja')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 44267 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Джбо
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/jbo')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 2704 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
каб
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/kab')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 15944 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ко
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ko')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 503 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
кВт
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/kw')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1328 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ла
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/la')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 6889 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
лфн
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/lfn')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 2313 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
лт
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/lt')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 8042 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
мк
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/mk')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 14678 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Мистер
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/mr')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 16413 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
обратите внимание
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/nb')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1094 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
НДС
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/nds')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 2633 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Нидерланды
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/nl')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 23561 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
орв
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/orv')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 471 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ота
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ota')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 486 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
пес
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/pes')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 4285 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
пожалуйста
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/pl')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 22391 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
пт
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/pt')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 78430 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
р-н
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/rn')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 648 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ро
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ro')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 2092 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ру
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ru')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 251263 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
сл
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/sl')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 706 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
сэр
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/sr')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 8175 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
св
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/sv')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 7005 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ТС
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/tk')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1165 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ТЛ
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/tl')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1017 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
спасибо
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/tlh')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 2804 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
токи
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/toki')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 3738 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
тр
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/tr')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 142088 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
тт
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/tt')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 2398 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
тьфу
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ug')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 1183 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Великобритания
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/uk')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 54431 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ты
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/ur')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 252 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ви
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/vi')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 962 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
во
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/vo')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 328 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
война
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/war')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 327 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
ууу
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/wuu')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 408 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Юэ
Используйте следующую команду, чтобы загрузить этот набор данных в TFDS:
ds = tfds.load('huggingface:tapaco/yue')
- Описание :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
- Лицензия : Creative Commons Attribution 2.0 Generic
- Версия : 1.0.0
- Расколы :
Расколоть | Примеры |
---|---|
'train' | 561 |
- Функции :
{
"paraphrase_set_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"sentence_id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"paraphrase": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"lists": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"tags": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"language": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}