TFDS agora suporta o formato Croissant 🥐 ! Leia a documentação para saber mais.

Esta página foi traduzida pela API Cloud Translation.

para_pat

Referências:

el-en

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/el-en')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	10855

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

cs-pt

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/cs-en')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	78977

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "cs",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-hu

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-hu')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	42629

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "hu"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-ro

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-ro')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	48789

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-sk

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-sk')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	23410

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "sk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-uk

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-uk')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	89226

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "uk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

es-fr

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/es-fr')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	32553

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "es",
            "fr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

de-ru

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/fr-ru')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	10889

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "fr",
            "ru"
        ],
        "id": null,
        "_type": "Translation"
    }
}

de-fr

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/de-fr')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	1167988

Características :

{
    "translation": {
        "languages": [
            "de",
            "fr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-ja

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-ja')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	6170339

Características :

{
    "translation": {
        "languages": [
            "en",
            "ja"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-es

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-es')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	649396

Características :

{
    "translation": {
        "languages": [
            "en",
            "es"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-fr

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-fr')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	12223525

Características :

{
    "translation": {
        "languages": [
            "en",
            "fr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

de-en

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/de-en')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	2165054

Características :

{
    "translation": {
        "languages": [
            "de",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-ko

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-ko')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	2324357

Características :

{
    "translation": {
        "languages": [
            "en",
            "ko"
        ],
        "id": null,
        "_type": "Translation"
    }
}

de-já

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/fr-ja')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	313422

Características :

{
    "translation": {
        "languages": [
            "fr",
            "ja"
        ],
        "id": null,
        "_type": "Translation"
    }
}

pt-zh

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-zh')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	4897841

Características :

{
    "translation": {
        "languages": [
            "en",
            "zh"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-ru

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-ru')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	4296399

Características :

{
    "translation": {
        "languages": [
            "en",
            "ru"
        ],
        "id": null,
        "_type": "Translation"
    }
}

fr-ko

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/fr-ko')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	120607

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "fr",
            "ko"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ru-uk

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/ru-uk')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	85963

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ru",
            "uk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-pt

Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:para_pat/en-pt')

Descrição :

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This dataset contains the developed parallel corpus from the open access Google
Patents dataset in 74 language pairs, comprising more than 68 million sentences
and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm
for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

We demonstrate the capabilities of our corpus by training Neural Machine Translation
(NMT) models for the main 9 language pairs, with a total of 18 models.

Licença : CC BY 4.0
Versão : 1.1.0
Divisões :

Dividir	Exemplos
`'train'`	23121

Características :

{
    "index": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "family_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "pt"
        ],
        "id": null,
        "_type": "Translation"
    }
}