estremo

Riferimenti:

XNLI

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XNLI')
  • Descrizione :
The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and
2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into
14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese,
Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the
corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. The corpus is made to
evaluate how to perform inference in any language (including low-resources ones like Swahili or Urdu) when only
English NLI data is available at training time. One solution is cross-lingual sentence encoding, for which XNLI
is an evaluation benchmark.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 75150
'validation' 37350
  • Caratteristiche :
{
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence1": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence2": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "gold_label": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tydiqa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tydiqa')
  • Descrizione :
Gold passage task (GoldP): Given a passage that is guaranteed to contain the
             answer, predict the single contiguous span of characters that answers the question. This is more similar to
             existing reading comprehension datasets (as opposed to the information-seeking task outlined above).
             This task is constructed with two goals in mind: (1) more directly comparing with prior work and (2) providing
             a simplified way for researchers to use TyDi QA by providing compatibility with existing code for SQuAD 1.1,
             XQuAD, and MLQA. Toward these goals, the gold passage task differs from the primary task in several ways:
             only the gold answer passage is provided rather than the entire Wikipedia article;
             unanswerable questions have been discarded, similar to MLQA and XQuAD;
             we evaluate with the SQuAD 1.1 metrics like XQuAD; and
            Thai and Japanese are removed since the lack of whitespace breaks some tools.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'train' 49881
'validation' 5077
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

Squadra

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/SQuAD')
  • Descrizione :
Stanford Question Answering Dataset (SQuAD) is a reading comprehension     dataset, consisting of questions posed by crowdworkers on a set of Wikipedia     articles, where the answer to every question is a segment of text, or span,     from the corresponding reading passage, or the question might be unanswerable.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'train' 87599
'validation' 10570
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.af

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.af')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 5000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.ar')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.bg

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.bg')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.bn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.bn')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 10000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.de')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.el

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.el')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.en')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.es')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.et

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.et')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 15000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.eu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.eu')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 10000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.fa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.fa')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.fi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.fi')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.fr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.fr')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.he

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.he')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ciao

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.hi')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 5000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.hu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.hu')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.id

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.id')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.it

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.it')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ja

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.ja')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.jv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.jv')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 100
'train' 100
'validation' 100
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ka

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.ka')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 10000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.kk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.kk')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 1000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ko

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.ko')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ml

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.ml')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 10000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.mr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.mr')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 5000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ms

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.ms')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 20000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.my

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.my')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 100
'train' 100
'validation' 100
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.nl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.nl')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.pt

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.pt')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ru

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.ru')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.sw

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.sw')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 1000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ta

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.ta')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 15000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.te

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.te')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 1000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.esimo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.th')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.tl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.tl')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 10000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.tr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.tr')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.ur

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.ur')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
'train' 20000
'validation' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.vi')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.yo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.yo')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 100
'train' 100
'validation' 100
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

PAN-X.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAN-X.zh')
  • Descrizione :
The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been
constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset
can be loaded with the DaNLP package:
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 10000
'train' 20000
'validation' 10000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "O",
                "B-PER",
                "I-PER",
                "B-ORG",
                "I-ORG",
                "B-LOC",
                "I-LOC"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "langs": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.ar.ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.ar.ar')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5335
'validation' 517
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.ar.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.ar.de')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1649
'validation' 207
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.ar.vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.ar.vi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2047
'validation' 163
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.ar.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.ar.zh')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1912
'validation' 188
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.ar.en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.ar.en')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5335
'validation' 517
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.ar.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.ar.es')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1978
'validation' 161
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.ar.hi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.ar.hi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1831
'validation' 186
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.de.ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.de.ar')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1649
'validation' 207
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.de.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.de.de')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 4517
'validation' 512
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.de.vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.de.vi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1675
'validation' 182
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.de.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.de.zh')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1621
'validation' 190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.de.en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.de.en')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 4517
'validation' 512
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.de.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.de.es')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1776
'validation' 196
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.de.hi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.de.hi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1430
'validation' 163
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.vi.ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.vi.ar')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2047
'validation' 163
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.vi.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.vi.de')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1675
'validation' 182
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.vi.vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.vi.vi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5495
'validation' 511
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.vi.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.vi.zh')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1943
'validation' 184
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.vi.en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.vi.en')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5495
'validation' 511
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.vi.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.vi.es')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2018
'validation' 189
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.vi.hi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.vi.hi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1947
'validation' 177
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.zh.ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.zh.ar')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1912
'validation' 188
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.zh.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.zh.de')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1621
'validation' 190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.zh.vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.zh.vi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1943
'validation' 184
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.zh.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.zh.zh')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5137
'validation' 504
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.zh.en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.zh.en')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5137
'validation' 504
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.zh.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.zh.es')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1947
'validation' 161
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.zh.hi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.zh.hi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1767
'validation' 189
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.en.ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.en.ar')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5335
'validation' 517
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.en.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.en.de')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 4517
'validation' 512
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.en.vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.en.vi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5495
'validation' 511
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.en.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.en.zh')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5137
'validation' 504
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.en.en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.en.en')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 11590
'validation' 1148
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.en.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.en.es')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5253
'validation' 500
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.en.hi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.en.hi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 4918
'validation' 507
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.es.ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.es.ar')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1978
'validation' 161
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.es.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.es.de')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1776
'validation' 196
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.es.vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.es.vi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2018
'validation' 189
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.es.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.es.zh')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1947
'validation' 161
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.es.en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.es.en')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5253
'validation' 500
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.es.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.es.es')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5253
'validation' 500
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.es.hi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.es.hi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1723
'validation' 187
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.hi.ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.hi.ar')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1831
'validation' 186
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.hi.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.hi.de')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1430
'validation' 163
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.hi.vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.hi.vi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1947
'validation' 177
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.hi.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.hi.zh')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1767
'validation' 189
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.hi.en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.hi.en')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 4918
'validation' 507
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.hi.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.hi.es')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1723
'validation' 187
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

MLQA.ciao.ciao

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/MLQA.hi.hi')
  • Descrizione :
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 4918
'validation' 507
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.ar')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.de')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.vi')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.zh')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.en')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.es')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.ciao

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.hi')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.el

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.el')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.ru

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.ru')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.th

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.th')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

XQuAD.tr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/XQuAD.tr')
  • Descrizione :
XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question
answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from
the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into
ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently,
the dataset is entirely parallel across 11 languages.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1190
  • Caratteristiche :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "context": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "question": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "answers": {
        "feature": {
            "answer_start": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

bucc18.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/bucc18.de')
  • Descrizione :
Building and Using Comparable Corpora

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 9580
'validation' 1038
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

bucc18.fr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/bucc18.fr')
  • Descrizione :
Building and Using Comparable Corpora

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 9086
'validation' 929
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

bucc18.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/bucc18.zh')
  • Descrizione :
Building and Using Comparable Corpora

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1899
'validation' 257
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

bucc18.ru

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/bucc18.ru')
  • Descrizione :
Building and Using Comparable Corpora

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 14435
'validation' 2374
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

PAWS-X.de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAWS-X.de')
  • Descrizione :
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training
pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All
translated pairs are sourced from examples in PAWS-Wiki.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2000
'train' 49380
'validation' 2000
  • Caratteristiche :
{
    "sentence1": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence2": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

PAWS-X.it

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAWS-X.en')
  • Descrizione :
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training
pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All
translated pairs are sourced from examples in PAWS-Wiki.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2000
'train' 49175
'validation' 2000
  • Caratteristiche :
{
    "sentence1": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence2": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

PAWS-X.es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAWS-X.es')
  • Descrizione :
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training
pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All
translated pairs are sourced from examples in PAWS-Wiki.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2000
'train' 49401
'validation' 1961
  • Caratteristiche :
{
    "sentence1": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence2": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

PAWS-X.fr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAWS-X.fr')
  • Descrizione :
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training
pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All
translated pairs are sourced from examples in PAWS-Wiki.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2000
'train' 49399
'validation' 1988
  • Caratteristiche :
{
    "sentence1": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence2": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

PAWS-X.ja

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAWS-X.ja')
  • Descrizione :
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training
pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All
translated pairs are sourced from examples in PAWS-Wiki.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2000
'train' 49401
'validation' 2000
  • Caratteristiche :
{
    "sentence1": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence2": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

PAWS-X.ko

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAWS-X.ko')
  • Descrizione :
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training
pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All
translated pairs are sourced from examples in PAWS-Wiki.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1999
'train' 49164
'validation' 2000
  • Caratteristiche :
{
    "sentence1": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence2": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ZAMPE-X.zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/PAWS-X.zh')
  • Descrizione :
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training
pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All
translated pairs are sourced from examples in PAWS-Wiki.
The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2000
'train' 49401
'validation' 2000
  • Caratteristiche :
{
    "sentence1": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence2": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.afr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.afr')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.ara

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.ara')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.ben

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.ben')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.bul

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.bul')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.deu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.deu')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.cmn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.cmn')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.ell

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.ell')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.est

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.est')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.eus

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.eus')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.fin

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.fin')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.fra

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.fra')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.heb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.heb')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.hin

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.hin')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.hun

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.hun')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.ind

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.ind')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.ita

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.ita')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.jav

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.jav')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 205
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.jpn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.jpn')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.kat

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.kat')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 746
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.kaz

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.kaz')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 575
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.kor

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.kor')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.mal

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.mal')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 687
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.mar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.mar')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.nld

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.nld')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.pes

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.pes')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.por

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.por')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.rus

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.rus')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.spa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.spa')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.swh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.swh')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 390
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.tam

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.tam')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 307
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.tel

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.tel')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 234
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.tgl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.tgl')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.tha

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.tha')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 548
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.tur

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.tur')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.urd

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.urd')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

tatoeba.vie

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/tatoeba.vie')
  • Descrizione :
his data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check
this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are
not directly comparable across languages. In particular, the sentences tend to have less variety for several
low-resource languages, e.g. "Tom needed water", "Tom needs water", "Tom is getting water", ...

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'validation' 1000
  • Caratteristiche :
{
    "source_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "source_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "target_lang": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

udpos.Afrikaans

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Afrikaans')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 425
'train' 1315
'validation' 194
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Arabo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Arabic')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1680
'train' 6075
'validation' 909
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Basco

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Basque')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1799
'train' 5396
'validation' 1798
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.bulgaro

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Bulgarian')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1116
'train' 8907
'validation' 1115
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.olandese

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Dutch')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1471
'train' 18051
'validation' 1394
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Inglese

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.English')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5440
'train' 21253
'validation' 3974
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.estone

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Estonian')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 3760
'train' 25749
'validation' 3125
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.finlandese

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Finnish')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 4422
'train' 27198
'validation' 3239
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.francese

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.French')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 9465
'train' 47308
'validation' 5979
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.tedesco

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.German')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 22458
'train' 166849
'validation' 19233
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Greco

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Greek')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2809
'train' 28152
'validation' 2559
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.ebraico

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Hebrew')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 491
'train' 5241
'validation' 484
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Hindi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Hindi')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2684
'train' 13304
'validation' 1659
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.ungherese

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Hungarian')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 449
'train' 910
'validation' 441
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.indonesiano

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Indonesian')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1557
'train' 4477
'validation' 559
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Italiano

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Italian')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 3518
'train' 29685
'validation' 2278
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.giapponese

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Japanese')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2372
'train' 7125
'validation' 511
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.kazako

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Kazakh')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1047
'train' 31
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Coreano

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Korean')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 4276
'train' 27410
'validation' 3016
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Cinese

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Chinese')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 5528
'train' 18998
'validation' 3038
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Marathi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Marathi')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 47
'train' 373
'validation' 46
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.persiano

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Persian')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 600
'train' 4798
'validation' 599
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.portoghese

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Portuguese')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 2681
'train' 17992
'validation' 1770
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.russo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Russian')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 11336
'train' 67435
'validation' 9960
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Spagnolo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Spanish')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 3147
'train' 28492
'validation' 3054
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Tagalog

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Tagalog')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 55
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Tamil

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Tamil')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 120
'train' 400
'validation' 80
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Telugu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Telugu')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 146
'train' 1051
'validation' 131
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Thai

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Thai')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 1000
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.turco

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Turkish')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 4785
'train' 3664
'validation' 988
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Urdu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Urdu')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 535
'train' 4043
'validation' 552
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.vietnamita

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Vietnamese')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 800
'train' 1400
'validation' 800
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

udpos.Yoruba

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:xtreme/udpos.Yoruba')
  • Descrizione :
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological
features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200
contributors producing more than 100 treebanks in over 70 languages. If you’re new to UD, you should start by reading
the first part of the Short Introduction and then browsing the annotation guidelines.

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of
the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages
(spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of
syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks,
and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil
(spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the
Niger-Congo languages Swahili and Yoruba, spoken in Africa.
  • Licenza : nessuna licenza conosciuta
  • Versione : 1.0.0
  • Divide :
Diviso Esempi
'test' 100
  • Caratteristiche :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 17,
            "names": [
                "ADJ",
                "ADP",
                "ADV",
                "AUX",
                "CCONJ",
                "DET",
                "INTJ",
                "NOUN",
                "NUM",
                "PART",
                "PRON",
                "PROPN",
                "PUNCT",
                "SCONJ",
                "SYM",
                "VERB",
                "X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}