Home / Open Source Tech / Article
Open Source Tech News

Common Voice 20 is Now Available

Mozilla Founda…
2024-12-11 5 min read

<div class="block-paragraph"> <div class="tw-container -block "> <div class="tw-row tw-justify-center "> <div class="streamfield-content "> <p>The Common Voice team is ...

The Common Voice team is honoured to be able to announce that the 20th version of our multilingual, open speech dataset is now available.

This dataset release sees Aragonese, IsiNdebele (sometimes also known as Southern Ndebele), Southern Sotho and Tupuri to the dataset for the first time. The dedicated language activists, translators and contributors for these new languages have done amazing work creating open speech data for their languages that anyone can build on. These new languages bring the total number of languages in the Common Voice Scripted Speech dataset to 133 in total.

This release includes contributions made through December 6th, 2024 and adds 566 new hours of speech and 515 newly validated hours of speech.

This brings the total hours of available speech data in the Common Voice dataset to 33,150 hours. 22,108 hours has had quality assurance (“validation”) crowdsourced through the community. This dataset is a monument to the power of community.

We’re always excited to hear feedback from contributors, dataset users and language activists. We are especially excited to learn more about what people are researching or building using the dataset. If you want to chat to us about it, you can join our new Discord community or email us at commonvoice@mozilla.com

Source: Mozilla Foundation Blog Word count: 1642 words
Published on 2024-12-11 21:13