common_voice

  • Description:

Mozilla Common Voice Dataset

FeaturesDict({
    'accent': Text(shape=(), dtype=string),
    'age': Text(shape=(), dtype=string),
    'client_id': Text(shape=(), dtype=string),
    'downvotes': Scalar(shape=(), dtype=int32, description=Number of people who said audio does not match text),
    'gender': ClassLabel(shape=(), dtype=int64, num_classes=3),
    'segment': Text(shape=(), dtype=string),
    'sentence': Text(shape=(), dtype=string),
    'upvotes': Scalar(shape=(), dtype=int32, description=Number of people who said audio matches the text),
    'voice': Audio(shape=(None,), dtype=int64),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
accent Text string Accent of the speaker, see https://github.com/common-voice/common-voice/blob/main/web/src/stores/demographics.ts
age Text string Age bucket of the speaker (e.g. teens, or fourties), see https://github.com/common-voice/common-voice/blob/main/web/src/stores/demographics.ts
client_id Text string Hashed UUID of a given user
downvotes Scalar int32 Number of people who said audio does not match text
gender ClassLabel int64 Gender of the speaker
segment Text string If sentence belongs to a custom dataset segment, it will be listed here
sentence Text string Supposed transcription of the audio
upvotes Scalar int32 Number of people who said audio matches the text
voice Audio (None,) int64

common_voice/en (default config)

  • Config description: Language Code: en

  • Download size: 56.45 GiB

  • Dataset size: 2.79 TiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 16,164
'test' 16,164
'train' 564,337
'validation' 1,224,864

common_voice/ab

  • Config description: Language Code: ab

  • Download size: 39.14 MiB

  • Dataset size: 133.24 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'test' 9
'train' 22
'validation' 31

common_voice/ar

  • Config description: Language Code: ar

  • Download size: 1.64 GiB

  • Dataset size: 67.16 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 7,517
'test' 7,622
'train' 14,227
'validation' 43,291

common_voice/as

  • Config description: Language Code: as

  • Download size: 21.20 MiB

  • Dataset size: 1.65 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 124
'test' 110
'train' 270
'validation' 504

common_voice/br

  • Config description: Language Code: br

  • Download size: 443.72 MiB

  • Dataset size: 13.46 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,997
'test' 2,087
'train' 2,780
'validation' 8,560

common_voice/ca

  • Config description: Language Code: ca

  • Download size: 19.32 GiB

  • Dataset size: 1.19 TiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 15,724
'test' 15,724
'train' 285,584
'validation' 416,701

common_voice/cnh

  • Config description: Language Code: cnh

  • Download size: 153.86 MiB

  • Dataset size: 5.12 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 756
'test' 752
'train' 807
'validation' 2,432

common_voice/cs

  • Config description: Language Code: cs

  • Download size: 1.18 GiB

  • Dataset size: 56.89 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 4,118
'test' 4,144
'train' 5,655
'validation' 30,431

common_voice/cv

  • Config description: Language Code: cv

  • Download size: 418.98 MiB

  • Dataset size: 8.10 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 818
'test' 788
'train' 931
'validation' 3,496

common_voice/cy

  • Config description: Language Code: cy

  • Download size: 3.20 GiB

  • Dataset size: 128.68 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 4,776
'test' 4,820
'train' 6,839
'validation' 72,984

common_voice/de

  • Config description: Language Code: de

  • Download size: 21.68 GiB

  • Dataset size: 1.29 TiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 15,588
'test' 15,588
'train' 246,525
'validation' 565,186

common_voice/dv

  • Config description: Language Code: dv

  • Download size: 515.45 MiB

  • Dataset size: 31.59 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 2,077
'test' 2,202
'train' 2,680
'validation' 11,866

common_voice/el

  • Config description: Language Code: el

  • Download size: 363.89 MiB

  • Dataset size: 14.62 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,401
'test' 1,522
'train' 2,316
'validation' 5,996

common_voice/eo

  • Config description: Language Code: eo

  • Download size: 2.69 GiB

  • Dataset size: 167.14 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 8,987
'test' 8,969
'train' 19,587
'validation' 58,094

common_voice/es

  • Config description: Language Code: es

  • Download size: 15.08 GiB

  • Dataset size: 684.66 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 15,089
'test' 15,089
'train' 161,813
'validation' 236,314

common_voice/et

  • Config description: Language Code: et

  • Download size: 731.63 MiB

  • Dataset size: 37.95 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 2,507
'test' 2,509
'train' 2,966
'validation' 10,683

common_voice/eu

  • Config description: Language Code: eu

  • Download size: 3.41 GiB

  • Dataset size: 127.60 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 5,172
'test' 5,172
'train' 7,505
'validation' 63,009

common_voice/fa

  • Config description: Language Code: fa

  • Download size: 8.27 GiB

  • Dataset size: 328.61 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 5,213
'test' 5,213
'train' 7,593
'validation' 251,659

common_voice/fi

  • Config description: Language Code: fi

  • Download size: 47.57 MiB

  • Dataset size: 3.41 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 415
'test' 428
'train' 460
'validation' 1,305

common_voice/fr

  • Config description: Language Code: fr

  • Download size: 17.82 GiB

  • Dataset size: 1.17 TiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 15,763
'test' 15,763
'train' 298,982
'validation' 461,004

common_voice/fy-NL

  • Config description: Language Code: fy-NL

  • Download size: 1.15 GiB

  • Dataset size: 29.93 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 2,790
'test' 3,020
'train' 3,927
'validation' 10,495

common_voice/ga-IE

  • Config description: Language Code: ga-IE

  • Download size: 149.30 MiB

  • Dataset size: 5.11 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 497
'test' 506
'train' 541
'validation' 3,352

common_voice/hi

  • Config description: Language Code: hi

  • Download size: 20.43 MiB

  • Dataset size: 1.15 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 135
'test' 127
'train' 157
'validation' 419

common_voice/hsb

  • Config description: Language Code: hsb

  • Download size: 75.69 MiB

  • Dataset size: 5.67 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 172
'test' 387
'train' 808
'validation' 1,367

common_voice/hu

  • Config description: Language Code: hu

  • Download size: 231.51 MiB

  • Dataset size: 17.07 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,434
'test' 1,649
'train' 3,348
'validation' 6,457

common_voice/ia

  • Config description: Language Code: ia

  • Download size: 216.01 MiB

  • Dataset size: 14.99 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,601
'test' 899
'train' 3,477
'validation' 5,978

common_voice/id

  • Config description: Language Code: id

  • Download size: 453.87 MiB

  • Dataset size: 17.20 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,835
'test' 1,844
'train' 2,130
'validation' 8,696

common_voice/it

  • Config description: Language Code: it

  • Download size: 5.20 GiB

  • Dataset size: 316.38 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 12,928
'test' 12,928
'train' 58,015
'validation' 102,579

common_voice/ja

  • Config description: Language Code: ja

  • Download size: 145.80 MiB

  • Dataset size: 6.83 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 586
'test' 632
'train' 722
'validation' 3,072

common_voice/ka

  • Config description: Language Code: ka

  • Download size: 99.45 MiB

  • Dataset size: 7.51 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 527
'test' 656
'train' 1,058
'validation' 2,275

common_voice/kab

  • Config description: Language Code: kab

  • Download size: 15.99 GiB

  • Dataset size: 718.51 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 14,622
'test' 14,622
'train' 120,530
'validation' 573,718

common_voice/ky

  • Config description: Language Code: ky

  • Download size: 552.60 MiB

  • Dataset size: 18.70 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,511
'test' 1,503
'train' 1,955
'validation' 9,236

common_voice/lg

  • Config description: Language Code: lg

  • Download size: 198.55 MiB

  • Dataset size: 6.65 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 384
'test' 584
'train' 1,250
'validation' 2,220

common_voice/lt

  • Config description: Language Code: lt

  • Download size: 129.03 MiB

  • Dataset size: 4.79 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 244
'test' 466
'train' 931
'validation' 1,644

common_voice/lv

  • Config description: Language Code: lv

  • Download size: 198.66 MiB

  • Dataset size: 13.07 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 2,002
'test' 1,882
'train' 2,552
'validation' 6,444

common_voice/mn

  • Config description: Language Code: mn

  • Download size: 463.84 MiB

  • Dataset size: 22.09 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,837
'test' 1,862
'train' 2,183
'validation' 7,487

common_voice/mt

  • Config description: Language Code: mt

  • Download size: 405.42 MiB

  • Dataset size: 15.09 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,516
'test' 1,617
'train' 2,036
'validation' 5,747

common_voice/nl

  • Config description: Language Code: nl

  • Download size: 1.62 GiB

  • Dataset size: 90.20 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 4,938
'test' 5,708
'train' 9,460
'validation' 52,488

common_voice/or

  • Config description: Language Code: or

  • Download size: 189.85 MiB

  • Dataset size: 1.97 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 129
'test' 98
'train' 388
'validation' 615

common_voice/pa-IN

  • Config description: Language Code: pa-IN

  • Download size: 66.52 MiB

  • Dataset size: 1.03 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 44
'test' 116
'train' 211
'validation' 371

common_voice/pl

  • Config description: Language Code: pl

  • Download size: 3.29 GiB

  • Dataset size: 141.06 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 5,153
'test' 5,153
'train' 7,468
'validation' 90,791

common_voice/pt

  • Config description: Language Code: pt

  • Download size: 1.59 GiB

  • Dataset size: 75.64 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 4,592
'test' 4,641
'train' 6,514
'validation' 41,584

common_voice/rm-sursilv

  • Config description: Language Code: rm-sursilv

  • Download size: 263.17 MiB

  • Dataset size: 12.31 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,205
'test' 1,194
'train' 1,384
'validation' 3,783

common_voice/rm-vallader

  • Config description: Language Code: rm-vallader

  • Download size: 103.11 MiB

  • Dataset size: 4.89 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 357
'test' 378
'train' 574
'validation' 1,316

common_voice/ro

  • Config description: Language Code: ro

  • Download size: 249.84 MiB

  • Dataset size: 14.54 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 858
'test' 1,778
'train' 3,399
'validation' 6,039

common_voice/ru

  • Config description: Language Code: ru

  • Download size: 3.40 GiB

  • Dataset size: 175.04 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 7,963
'test' 8,007
'train' 15,481
'validation' 74,256

common_voice/rw

  • Config description: Language Code: rw

  • Download size: 39.62 GiB

  • Dataset size: 2.18 TiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 15,032
'test' 15,724
'train' 515,197
'validation' 832,929

common_voice/sah

  • Config description: Language Code: sah

  • Download size: 172.85 MiB

  • Dataset size: 9.42 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 405
'test' 757
'train' 1,442
'validation' 2,606

common_voice/sl

  • Config description: Language Code: sl

  • Download size: 212.43 MiB

  • Dataset size: 9.67 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 556
'test' 881
'train' 2,038
'validation' 4,669

common_voice/sv-SE

  • Config description: Language Code: sv-SE

  • Download size: 401.91 MiB

  • Dataset size: 18.27 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 2,019
'test' 2,027
'train' 2,331
'validation' 12,552

common_voice/ta

  • Config description: Language Code: ta

  • Download size: 648.28 MiB

  • Dataset size: 24.06 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,779
'test' 1,781
'train' 2,009
'validation' 12,652

common_voice/th

  • Config description: Language Code: th

  • Download size: 325.49 MiB

  • Dataset size: 18.32 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,922
'test' 2,188
'train' 2,917
'validation' 7,028

common_voice/tr

  • Config description: Language Code: tr

  • Download size: 592.09 MiB

  • Dataset size: 28.21 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 1,647
'test' 1,647
'train' 1,831
'validation' 18,685

common_voice/tt

  • Config description: Language Code: tt

  • Download size: 741.15 MiB

  • Dataset size: 46.85 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 2,127
'test' 4,485
'train' 11,211
'validation' 25,781

common_voice/uk

  • Config description: Language Code: uk

  • Download size: 1.13 GiB

  • Dataset size: 49.66 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 3,236
'test' 3,235
'train' 4,035
'validation' 22,337

common_voice/vi

  • Config description: Language Code: vi

  • Download size: 49.52 MiB

  • Dataset size: 1.47 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 200
'test' 198
'train' 221
'validation' 619

common_voice/vot

  • Config description: Language Code: vot

  • Download size: 7.43 MiB

  • Dataset size: 11.39 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 3
'validation' 3

common_voice/zh-CN

  • Config description: Language Code: zh-CN

  • Download size: 2.03 GiB

  • Dataset size: 122.54 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 8,743
'test' 8,760
'train' 18,541
'validation' 36,405

common_voice/zh-HK

  • Config description: Language Code: zh-HK

  • Download size: 2.58 GiB

  • Dataset size: 78.80 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 5,172
'test' 5,172
'train' 7,506
'validation' 41,835

common_voice/zh-TW

  • Config description: Language Code: zh-TW

  • Download size: 2.03 GiB

  • Dataset size: 69.06 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'dev' 2,895
'test' 2,895
'train' 3,507
'validation' 61,232