References:
polish
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:multilingual_librispeech/polish')
- Description:
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
- License: No known license
- Version: 2.1.0
- Splits:
Split | Examples |
---|---|
'test' |
520 |
'train' |
25043 |
'train.1h' |
238 |
'train.9h' |
2173 |
'validation' |
512 |
- Features:
{
"file": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"audio": {
"sampling_rate": 16000,
"mono": true,
"id": null,
"_type": "Audio"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"speaker_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"chapter_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
german
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:multilingual_librispeech/german')
- Description:
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
- License: No known license
- Version: 2.1.0
- Splits:
Split | Examples |
---|---|
'test' |
3394 |
'train' |
469942 |
'train.1h' |
241 |
'train.9h' |
2194 |
'validation' |
3469 |
- Features:
{
"file": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"audio": {
"sampling_rate": 16000,
"mono": true,
"id": null,
"_type": "Audio"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"speaker_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"chapter_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
dutch
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:multilingual_librispeech/dutch')
- Description:
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
- License: No known license
- Version: 2.1.0
- Splits:
Split | Examples |
---|---|
'test' |
3075 |
'train' |
374287 |
'train.1h' |
234 |
'train.9h' |
2153 |
'validation' |
3095 |
- Features:
{
"file": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"audio": {
"sampling_rate": 16000,
"mono": true,
"id": null,
"_type": "Audio"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"speaker_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"chapter_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
french
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:multilingual_librispeech/french')
- Description:
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
- License: No known license
- Version: 2.1.0
- Splits:
Split | Examples |
---|---|
'test' |
2426 |
'train' |
258213 |
'train.1h' |
241 |
'train.9h' |
2167 |
'validation' |
2416 |
- Features:
{
"file": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"audio": {
"sampling_rate": 16000,
"mono": true,
"id": null,
"_type": "Audio"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"speaker_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"chapter_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
spanish
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:multilingual_librispeech/spanish')
- Description:
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
- License: No known license
- Version: 2.1.0
- Splits:
Split | Examples |
---|---|
'test' |
2385 |
'train' |
220701 |
'train.1h' |
233 |
'train.9h' |
2110 |
'validation' |
2408 |
- Features:
{
"file": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"audio": {
"sampling_rate": 16000,
"mono": true,
"id": null,
"_type": "Audio"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"speaker_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"chapter_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
italian
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:multilingual_librispeech/italian')
- Description:
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
- License: No known license
- Version: 2.1.0
- Splits:
Split | Examples |
---|---|
'test' |
1262 |
'train' |
59623 |
'train.1h' |
240 |
'train.9h' |
2173 |
'validation' |
1248 |
- Features:
{
"file": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"audio": {
"sampling_rate": 16000,
"mono": true,
"id": null,
"_type": "Audio"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"speaker_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"chapter_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
portuguese
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:multilingual_librispeech/portuguese')
- Description:
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
- License: No known license
- Version: 2.1.0
- Splits:
Split | Examples |
---|---|
'test' |
871 |
'train' |
37533 |
'train.1h' |
236 |
'train.9h' |
2116 |
'validation' |
826 |
- Features:
{
"file": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"audio": {
"sampling_rate": 16000,
"mono": true,
"id": null,
"_type": "Audio"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"speaker_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"chapter_id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}