참고자료:
bg-bs
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bg-bs')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 136009 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bg",
"bs"
],
"id": null,
"_type": "Translation"
}
}
bg-el
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bg-el')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 212437 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bg",
"el"
],
"id": null,
"_type": "Translation"
}
}
bs-el
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bs-el')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 137602 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bs",
"el"
],
"id": null,
"_type": "Translation"
}
}
bg-en
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bg-en')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 213160 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bg",
"en"
],
"id": null,
"_type": "Translation"
}
}
bs-en
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bs-en')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 138387 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bs",
"en"
],
"id": null,
"_type": "Translation"
}
}
엘엔
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/el-en')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 227168 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"el",
"en"
],
"id": null,
"_type": "Translation"
}
}
bg-hr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bg-hr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 203465 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bg",
"hr"
],
"id": null,
"_type": "Translation"
}
}
bs-시간
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bs-hr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 138402 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bs",
"hr"
],
"id": null,
"_type": "Translation"
}
}
엘시
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/el-hr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 205008 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"el",
"hr"
],
"id": null,
"_type": "Translation"
}
}
시간
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/en-hr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 205910 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"en",
"hr"
],
"id": null,
"_type": "Translation"
}
}
bg-mk
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bg-mk')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 207169 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bg",
"mk"
],
"id": null,
"_type": "Translation"
}
}
bs-mk
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bs-mk')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 132779 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bs",
"mk"
],
"id": null,
"_type": "Translation"
}
}
엘-mk
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/el-mk')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 207262 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"el",
"mk"
],
"id": null,
"_type": "Translation"
}
}
ko-mk
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/en-mk')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 207777 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"en",
"mk"
],
"id": null,
"_type": "Translation"
}
}
시간-mk
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/hr-mk')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 198876 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"hr",
"mk"
],
"id": null,
"_type": "Translation"
}
}
bg-ro
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bg-ro')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 210842 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bg",
"ro"
],
"id": null,
"_type": "Translation"
}
}
bs-ro
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bs-ro')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 137365 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bs",
"ro"
],
"id": null,
"_type": "Translation"
}
}
엘로
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/el-ro')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 212359 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"el",
"ro"
],
"id": null,
"_type": "Translation"
}
}
엔로
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/en-ro')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 213047 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"en",
"ro"
],
"id": null,
"_type": "Translation"
}
}
시간로
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/hr-ro')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 203777 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"hr",
"ro"
],
"id": null,
"_type": "Translation"
}
}
MK로
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/mk-ro')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 206168 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"mk",
"ro"
],
"id": null,
"_type": "Translation"
}
}
bg-제곱
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bg-sq')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 211518 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bg",
"sq"
],
"id": null,
"_type": "Translation"
}
}
bs-제곱
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bs-sq')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 137953 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bs",
"sq"
],
"id": null,
"_type": "Translation"
}
}
엘-스퀘어
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/el-sq')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 226577 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"el",
"sq"
],
"id": null,
"_type": "Translation"
}
}
en-sq
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/en-sq')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 227516 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"en",
"sq"
],
"id": null,
"_type": "Translation"
}
}
시간-제곱
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/hr-sq')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 205044 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"hr",
"sq"
],
"id": null,
"_type": "Translation"
}
}
mk-제곱
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/mk-sq')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 206601 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"mk",
"sq"
],
"id": null,
"_type": "Translation"
}
}
로제곱
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/ro-sq')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 212320 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"ro",
"sq"
],
"id": null,
"_type": "Translation"
}
}
bg-sr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bg-sr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 211172 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bg",
"sr"
],
"id": null,
"_type": "Translation"
}
}
bs-sr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bs-sr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 135945 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bs",
"sr"
],
"id": null,
"_type": "Translation"
}
}
엘-시르
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/el-sr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 224311 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"el",
"sr"
],
"id": null,
"_type": "Translation"
}
}
ko-sr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/en-sr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 225169 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"en",
"sr"
],
"id": null,
"_type": "Translation"
}
}
시간-시간
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/hr-sr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 203989 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"hr",
"sr"
],
"id": null,
"_type": "Translation"
}
}
mk-sr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/mk-sr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 207295 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"mk",
"sr"
],
"id": null,
"_type": "Translation"
}
}
ro-sr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/ro-sr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 210612 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"ro",
"sr"
],
"id": null,
"_type": "Translation"
}
}
sq-sr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/sq-sr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 224595 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"sq",
"sr"
],
"id": null,
"_type": "Translation"
}
}
bg-tr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bg-tr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 206071 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bg",
"tr"
],
"id": null,
"_type": "Translation"
}
}
bs-tr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/bs-tr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 133958 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"bs",
"tr"
],
"id": null,
"_type": "Translation"
}
}
엘-tr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/el-tr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 207029 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"el",
"tr"
],
"id": null,
"_type": "Translation"
}
}
en-tr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/en-tr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 207678 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"en",
"tr"
],
"id": null,
"_type": "Translation"
}
}
시간-tr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/hr-tr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 199260 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"hr",
"tr"
],
"id": null,
"_type": "Translation"
}
}
mk-tr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/mk-tr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 203231 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"mk",
"tr"
],
"id": null,
"_type": "Translation"
}
}
로트르
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/ro-tr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 206104 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"ro",
"tr"
],
"id": null,
"_type": "Translation"
}
}
제곱-tr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/sq-tr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 207107 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"sq",
"tr"
],
"id": null,
"_type": "Translation"
}
}
sr-tr
TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.
ds = tfds.load('huggingface:setimes/sr-tr')
- 설명 :
SETimes – A Parallel Corpus of English and South-East European Languages
The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:
- stricter extraction process – no HTML residues present
- language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
- resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.
- 라이센스 : 알려진 라이센스 없음
- 버전 : 1.0.0
- 분할 :
나뉘다 | 예 |
---|---|
'train' | 205993 |
- 특징 :
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"translation": {
"languages": [
"sr",
"tr"
],
"id": null,
"_type": "Translation"
}
}