🗣Voice Corpus

Our purpose

Donate and validate our voices under public domainarrow-up-right licence to generate a dataset usable by Speech to Text technologies to train models in different languages democratizing voice technology.

Who we are

We are a community of voice tech enthusiasts, who want to help collect and generate a large dataset of public domain voices that can be freely used to train Speech to Text technologiesarrow-up-right.

What’s success

Collect and validate as many voices as possible in our languages. Having more voices validated allows us to then train more advanced STT models.

  • At least 1,000 unique speakers per language.

  • 2,000 hours of voice validated to train a near-human general STT model.

  • 10,000 hours of voice validated for a very high quality, general, large vocabulary, continuous speech recognition model.

Data quantities

How to join

Anyone can join this community. Join our discourse forumsarrow-up-right or our matrix chatarrow-up-right and introduce yourself, jump into Common Voice sitearrow-up-right, get familiar with it and start donating your voice.

🔨 You don’t need any specialized skill to contribute to this community, you only need to be able to speak into a microphone or listen to audio clips.

What we do

⚠️ In order to have a language enabled on our site, you will need at least 5000 validated sentences, see previous section about text corpus for reference.

Voice donation

We have developed a site that allows you to donate your voicearrow-up-right by reading sentences collected by the community.

Feel free to create an account to track your progress and add more information on your profile about your voice. Demographic information helps us balance the dataset, giving machine learning researchers and engineers a way to train models that represent better the speakers of the language.

ℹ️ Please read the following community guidelines arrow-up-rightto know how to produce better voice donations.

⚠️ Note: Once you have recorded a decent amount of clips in your language (around 300), it’s more valuable for less effort if you jump into helping to get new voices from other people and focus on the voice validation part, this will increase the dataset quality.

Voice validation

The same site allows you to review other people’s voicesarrow-up-right by listening to voices donated by the community. Each recording will need at least two positive validations from different people. Feel free to create an account to track your progress, compare with other contributors, set yourself goals or get awards badges.

ℹ️ Please read the following community guidelinesarrow-up-right to know how to better validate voices.

Community mobilization

You can help the community by organizing activities and encouraging others to do the same. Use the channels we have at our disposal to engage with other contributors in your language, talk about your ideas to grow the community and collect and validate more voices.

ℹ️ Check a few ideas from the Contribute to Common Voice activityarrow-up-right.

⭐️ You can re-use any graphical materialarrow-up-right we have produced to support the project.

Community support

Help other contributors in our discoursearrow-up-right and matrixarrow-up-right channels. Answering their questions about how to use the site or helping document reported issues on githubarrow-up-right.

Tooling development

The main development of our sitearrow-up-right is led by our staff team, but anyone can submit pull requests based on open issues, or minor UI bugs.

ℹ️ Please read the contribution guidelinesarrow-up-right before submitting any code.

Dataset releases

The complete text and voice dataset for languages where we have data is currently generated by the Common Voice staff team.

Currently, we are generating a new version of the datasets two times per year and publishing them on our sitearrow-up-right.

ℹ️ Note that we are asking for an email to send the link to the dataset (instead of direct download) because we want to have a way to contact everyone who downloaded the data in case we get deletion requests from contributors.

We understand that some people might want more frequent releases, and we are working on a more continuous release model to accommodate these needs.

Roles

These are some roles you can take as part of this community.

  • Voice donator: Donate your voice.

  • Voice validator: Help review other people’s voices.

  • Support: Join our community channels to support contributors with issues using our site.

  • Mobilizer: Help people in the community to get started and keep contributing.

  • Developer: Help submitting code and fixes to our site.

Channels

💬 If your language already exists on Common Voice, make sure you check and join the local discoursearrow-up-right and matrix room. If that’s not the case, please create a new topic on discoursearrow-up-right asking for one to be created.

Last updated