Text Corpus
Last updated
Was this helpful?
Last updated
Was this helpful?
Collect or generate text corpus under public domain licence that can be read by people to facilitate their voice donations.
We are a community of text collectors and creators, always looking for places with text corpora we can extract and process so it can be transformed into short and simple sentences for people to read.
Generate as many sentences as possible in our languages. Having more sentences allows contributors to donate more hours of voice data.
5,000 sentences allow 5,5 hrs of voice
9,000 sentences allow 10 hrs of voice
90,000 sentences allow 100 hrs of voice
1,800,000 sentences allow 2000 hrs of voice
⚠️ You will need at least 5000 validated sentences to have your language enabled for voice contributions on our voice collection site.
Anyone can join this community. Join our or our , introduce yourself and jump into our sentence tools right away.
Sentence extraction
This is the easiest and fastest way to get more than a million sentences as soon as possible for your language.
⚠️ Important: Due to legal reasons Mozilla needs to be the one running the final extraction, so please don’t do any manual processing to the resulting extraction during your tests. We can apply manual clean-up after the final version is generated by Mozilla.
🔨 Skills required to help: Command line usage and git, familiar with regular expressions.
Sentence collection
🔨 Skills required to help: Strong grammar knowledge of the target language you are contributing to.
Large corpus validation
If you have found an existing public domain corpus bigger than 100K sentences, we have an independent process to handle it, since we understand that manual validation using the sentence collector is not ideal.
🔨 Skills required to help: Expertise processing and cleaning up text, linguistics/language expertise to check the quality of the resulting sentences.
Tooling development
Contributors also develop, maintain and update the sentence extractor and collector code.
These are some roles you can take as part of this community.
Text searcher - Find and connect with sources and organizations that have or are willing to donate text corpus under public domain licence.
Text creator - Generate your own sentences and release them under public domain.
Validator - Help validate and review existing cleaned-up sentences.
Mobilizer - Help people in the community to get started and keep contributing.
Developer - Develop, maintain and update the sentence tooling.
We have developed from large sources of public domain text, with a focus easy-to-read corpus and Wikipedia.
ℹ️ Please read on how to generate specific rules for your language.
We have also created a that allows contributors to collect and validate sentences created by the community. You can use this tool also to import and clean-up small-to-medium-sized public domain corpus you have found or collected.
ℹ️ Please read before using this tool and check the .
ℹ️ Please create a new topic on discourse, so we can evaluate if your corpus fits the licence and size requirements to run this process.
Sentence Extractor: 🐞 - 🔨 Skills needed: Rust
Sentence Collector: 🐞 - 🔨 Skills needed: React, JavaScript, Node.js
Text processor - Cleaning up the raw text corpus to apply .
category.
chat room.
chat room.
.
💬 If your language already exists on Common Voice, make sure you and matrix room. If that’s not the case, please create a new topic asking for one to be created.