🗣Voice Corpus

Our purpose

Donate and validate our voices under public domain licence to generate a dataset usable by Speech to Text technologies to train models in different languages democratizing voice technology.

Who we are

We are a community of voice tech enthusiasts, who want to help collect and generate a large dataset of public domain voices that can be freely used to train Speech to Text technologies.

What’s success

Collect and validate as many voices as possible in our languages. Having more voices validated allows us to then train more advanced STT models.

  • At least 1,000 unique speakers per language.

  • 2,000 hours of voice validated to train a near-human general STT model.

  • 10,000 hours of voice validated for a very high quality, general, large vocabulary, continuous speech recognition model.

How to join

Anyone can join this community. Join our discourse forums or our matrix chat and introduce yourself, jump into Common Voice site, get familiar with it and start donating your voice.

🔨 You don’t need any specialized skill to contribute to this community, you only need to be able to speak into a microphone or listen to audio clips.

What we do

⚠️ In order to have a language enabled on our site, you will need at least 5000 validated sentences, see previous section about text corpus for reference.

Voice donation

We have developed a site that allows you to donate your voice by reading sentences collected by the community.

Feel free to create an account to track your progress and add more information on your profile about your voice. Demographic information helps us balance the dataset, giving machine learning researchers and engineers a way to train models that represent better the speakers of the language.

ℹ️ Please read the following community guidelines to know how to produce better voice donations.

⚠️ Note: Once you have recorded a decent amount of clips in your language (around 300), it’s more valuable for less effort if you jump into helping to get new voices from other people and focus on the voice validation part, this will increase the dataset quality.

Voice validation

The same site allows you to review other people’s voices by listening to voices donated by the community. Each recording will need at least two positive validations from different people. Feel free to create an account to track your progress, compare with other contributors, set yourself goals or get awards badges.

ℹ️ Please read the following community guidelines to know how to better validate voices.

Community mobilization

You can help the community by organizing activities and encouraging others to do the same. Use the channels we have at our disposal to engage with other contributors in your language, talk about your ideas to grow the community and collect and validate more voices.

ℹ️ Check a few ideas from the Contribute to Common Voice activity.

⭐️ You can re-use any graphical material we have produced to support the project.

Community support

Help other contributors in our discourse and matrix channels. Answering their questions about how to use the site or helping document reported issues on github.

Tooling development

The main development of our site is led by our staff team, but anyone can submit pull requests based on open issues, or minor UI bugs.

ℹ️ Please read the contribution guidelines before submitting any code.

Dataset releases

The complete text and voice dataset for languages where we have data is currently generated by the Common Voice staff team.

Currently, we are generating a new version of the datasets two times per year and publishing them on our site.

ℹ️ Note that we are asking for an email to send the link to the dataset (instead of direct download) because we want to have a way to contact everyone who downloaded the data in case we get deletion requests from contributors.

We understand that some people might want more frequent releases, and we are working on a more continuous release model to accommodate these needs.

Roles

These are some roles you can take as part of this community.

  • Voice donator: Donate your voice.

  • Voice validator: Help review other people’s voices.

  • Support: Join our community channels to support contributors with issues using our site.

  • Mobilizer: Help people in the community to get started and keep contributing.

  • Developer: Help submitting code and fixes to our site.

Channels

💬 If your language already exists on Common Voice, make sure you check and join the local discourse and matrix room. If that’s not the case, please create a new topic on discourse asking for one to be created.

Last updated