📝Text Corpus

Our purpose

Collect or generate text corpus under public domain licence that can be read by people to facilitate their voice donations.

Who we are

We are a community of text collectors and creators, always looking for places with text corpora we can extract and process so it can be transformed into short and simple sentences for people to read.

What’s success

Generate as many sentences as possible in our languages. Having more sentences allows contributors to donate more hours of voice data.

  • 5,000 sentences allow 5,5 hrs of voice

  • 9,000 sentences allow 10 hrs of voice

  • 90,000 sentences allow 100 hrs of voice

  • 1,800,000 sentences allow 2000 hrs of voice

⚠️ You will need at least 5000 validated sentences to have your language enabled for voice contributions on our voice collection site.

How to join

Anyone can join this community. Join our discourse forums or our matrix chat, introduce yourself and jump into our sentence tools right away.

What we do

Sentence extraction

We have developed a tool to extract sentences from large sources of public domain text, with a focus easy-to-read corpus and Wikipedia.

This is the easiest and fastest way to get more than a million sentences as soon as possible for your language.

ℹ️ Please read the tool documentation on how to generate specific rules for your language.

⚠️ Important: Due to legal reasons Mozilla needs to be the one running the final extraction, so please don’t do any manual processing to the resulting extraction during your tests. We can apply manual clean-up after the final version is generated by Mozilla.

🔨 Skills required to help: Command line usage and git, familiar with regular expressions.

Sentence collection

We have also created a sentence collection tool that allows contributors to collect and validate sentences created by the community. You can use this tool also to import and clean-up small-to-medium-sized public domain corpus you have found or collected.

ℹ️ Please read the collector how-to before using this tool and check the community guidelines on how to validate sentences.

🔨 Skills required to help: Strong grammar knowledge of the target language you are contributing to.

Large corpus validation

If you have found an existing public domain corpus bigger than 100K sentences, we have an independent process to handle it, since we understand that manual validation using the sentence collector is not ideal.

ℹ️ Please create a new topic on our discourse, so we can evaluate if your corpus fits the licence and size requirements to run this process.

🔨 Skills required to help: Expertise processing and cleaning up text, linguistics/language expertise to check the quality of the resulting sentences.

Tooling development

Contributors also develop, maintain and update the sentence extractor and collector code.

  • Sentence Extractor: 🐞 Open issues - 🔨 Skills needed: Rust

  • Sentence Collector: 🐞 Open issues - 🔨 Skills needed: React, JavaScript, Node.js

Roles

These are some roles you can take as part of this community.

  • Text searcher - Find and connect with sources and organizations that have or are willing to donate text corpus under public domain licence.

  • Text processor - Cleaning up the raw text corpus to apply our sentences requirements.

  • Text creator - Generate your own sentences and release them under public domain.

  • Validator - Help validate and review existing cleaned-up sentences.

  • Mobilizer - Help people in the community to get started and keep contributing.

  • Developer - Develop, maintain and update the sentence tooling.

Channels

💬 If your language already exists on Common Voice, make sure you check and join the local discourse and matrix room. If that’s not the case, please create a new topic on discourse asking for one to be created.

Last updated