🗣Voice Corpus
Last updated
Was this helpful?
Last updated
Was this helpful?
Donate and validate our voices under licence to generate a dataset usable by Speech to Text technologies to train models in different languages democratizing voice technology.
We are a community of voice tech enthusiasts, who want to help collect and generate a large dataset of public domain voices that can be freely used to train .
Collect and validate as many voices as possible in our languages. Having more voices validated allows us to then train more advanced STT models.
At least 1,000 unique speakers per language.
2,000 hours of voice validated to train a near-human general STT model.
10,000 hours of voice validated for a very high quality, general, large vocabulary, continuous speech recognition model.
🔨 You don’t need any specialized skill to contribute to this community, you only need to be able to speak into a microphone or listen to audio clips.
⚠️ In order to have a language enabled on our site, you will need at least 5000 validated sentences, see previous section about text corpus for reference.
Voice donation
Feel free to create an account to track your progress and add more information on your profile about your voice. Demographic information helps us balance the dataset, giving machine learning researchers and engineers a way to train models that represent better the speakers of the language.
⚠️ Note: Once you have recorded a decent amount of clips in your language (around 300), it’s more valuable for less effort if you jump into helping to get new voices from other people and focus on the voice validation part, this will increase the dataset quality.
Voice validation
Community mobilization
You can help the community by organizing activities and encouraging others to do the same. Use the channels we have at our disposal to engage with other contributors in your language, talk about your ideas to grow the community and collect and validate more voices.
Community support
Tooling development
Dataset releases
The complete text and voice dataset for languages where we have data is currently generated by the Common Voice staff team.
ℹ️ Note that we are asking for an email to send the link to the dataset (instead of direct download) because we want to have a way to contact everyone who downloaded the data in case we get deletion requests from contributors.
We understand that some people might want more frequent releases, and we are working on a more continuous release model to accommodate these needs.
These are some roles you can take as part of this community.
Voice donator: Donate your voice.
Voice validator: Help review other people’s voices.
Support: Join our community channels to support contributors with issues using our site.
Mobilizer: Help people in the community to get started and keep contributing.
Developer: Help submitting code and fixes to our site.
Anyone can join this community. Join our or our and introduce yourself, jump into , get familiar with it and start donating your voice.
We have developed a site that allows you to by reading sentences collected by the community.
ℹ️ to know how to produce better voice donations.
The same site allows you to by listening to voices donated by the community. Each recording will need at least two positive validations from different people. Feel free to create an account to track your progress, compare with other contributors, set yourself goals or get awards badges.
ℹ️ to know how to better validate voices.
ℹ️ Check a few ideas from the .
⭐️ You can re-use any we have produced to support the project.
Help other contributors in and channels. Answering their questions about how to use the site or helping document reported issues .
The main is led by our staff team, but anyone can submit pull requests based on open issues, or minor UI bugs.
ℹ️ before submitting any code.
Currently, we are generating a new version of the datasets two times per year and publishing them .
category.
chat room.
.
💬 If your language already exists on Common Voice, make sure you and matrix room. If that’s not the case, please create a new topic asking for one to be created.