بارها

مراجع:

bg-bs

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bg-bs')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 136009
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "bs"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-el

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bg-el')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 212437
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "el"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-el

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bs-el')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 137602
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "el"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-en

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bg-en')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 213160
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-en

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bs-en')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 138387
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

el-en

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/el-en')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 227168
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "en"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-hr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bg-hr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 203465
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-hr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bs-hr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 138402
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

el-hr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/el-hr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 205008
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-hr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/en-hr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 205910
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "hr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-mk

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bg-mk')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 207169
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-mk

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bs-mk')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 132779
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

el-mk

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/el-mk')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 207262
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-mk

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/en-mk')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 207777
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-mk

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/hr-mk')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 198876
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "mk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-ro

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bg-ro')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 210842
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-ro

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bs-ro')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 137365
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

الرو

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/el-ro')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 212359
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-ro

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/en-ro')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 213047
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-ro

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/hr-ro')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 203777
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

mk-ro

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/mk-ro')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 206168
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "ro"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-sq

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bg-sq')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 211518
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-sq

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bs-sq')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 137953
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

el-sq

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/el-sq')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 226577
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-sq

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/en-sq')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 227516
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-sq

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/hr-sq')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 205044
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

mk-sq

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/mk-sq')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 206601
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ro-sq

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/ro-sq')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 212320
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ro",
            "sq"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-sr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bg-sr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 211172
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-sr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bs-sr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 135945
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

el-sr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/el-sr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 224311
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-sr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/en-sr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 225169
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-sr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/hr-sr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 203989
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

mk-sr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/mk-sr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 207295
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ro-sr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/ro-sr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 210612
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ro",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

sq-sr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/sq-sr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 224595
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "sq",
            "sr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bg-tr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bg-tr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 206071
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bg",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

bs-tr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/bs-tr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 133958
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "bs",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

el-tr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/el-tr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 207029
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "el",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

en-tr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/en-tr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 207678
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "en",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

hr-tr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/hr-tr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 199260
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "hr",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

mk-tr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/mk-tr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 203231
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "mk",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ro-tr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/ro-tr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 206104
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ro",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

متر مربع

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/sq-tr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 207107
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "sq",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}

sr-tr

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:setimes/sr-tr')
  • توضیحات :
SETimes  A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes news and views from Southeast Europe in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

- stricter extraction process  no HTML residues present
- language identification on every non-English document  non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian  diacritics were partially lost due to encoding errors  text was rediacritized.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 1.0.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 205993
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "sr",
            "tr"
        ],
        "id": null,
        "_type": "Translation"
    }
}