mGhana-ST is an open-source African language speech dataset containing more than 7,800 annotated audio samples in Twi, Ewe, and Ga, published on Hugging Face under an MIT license. It is designed for automatic speech recognition (ASR), speech-to-text translation, speaker identification, and low-resource language modeling.

What is UGSpeechData?

UGSpeechData is a multilingual speech corpus created by researchers at the University of Ghana, led by Isaac Wiafe and colleagues at the Department of Computer Science. The corpus contains more than 970,000 audio files across five Ghanaian languages (Akan, Ewe, Dagbani, Dagaare, and Ikposo) totaling more than 5,300 hours of recorded speech with approximately 500 hours of transcription. Audio was collected from indigenous speakers describing culturally relevant images through a custom Android application developed by the University of Ghana team. AdwumaTech maintains a structured subset on Hugging Face, built in collaboration with the university, to support downstream ASR and speech translation development.

How many downloads does mGhana-ST have?

In March 2026, mGhana-ST recorded more than 12,330 downloads on Hugging Face, making it one of the most downloaded African language speech datasets on the platform that month. Per language, that equates to 4,110 downloads compared to 316 per language for Google's FLEURS and 526 per language for Google's WAXANLP.

What languages does mGhana-ST support?

The current release supports Twi, Ewe, and Ga. Planned expansion includes Dagbani, Dagaare, and additional Ghanaian and West African languages.

Who annotates the mGhana-ST dataset?

Every sample is annotated by software engineers and linguists with native fluency in the target language, capturing tonal semantics, code-switching patterns, and dialectal variation.

Nebula is AdwumaTech AI's open-source API marketplace currently in development. It is designed to provide developer-first access to AdwumaTech's suite of engineering APIs, including African language speech capabilities through SDKs, documentation, and community-driven contribution.

Can I use mGhana-ST for commercial applications?

Yes. mGhana-ST is released under an MIT license, which permits commercial and research use subject to the terms of that license.

What is the best open-source speech dataset for Twi?

mGhana-ST is one of the most widely used open-source speech datasets for Twi. It contains curated audio samples annotated by software engineers and linguists with native fluency, paired with English translations and non-verbal event tags, making it suitable for production-grade ASR and speech translation research.

mGhana: Open-Source Speech Dataset for Ghanaian Languages

In March 2026, mGhana-ST was one of the most downloaded African language speech datasets on Hugging Face. Here is what it contains, how it was built, and why it matters.

The Data Gap

Frontier models handle English, Mandarin, Spanish, French, and German with remarkable accuracy. Ask those same models to transcribe a medical consultation conducted in Twi, route an emergency call spoken in Ewe, or process a mobile banking transaction initiated by voice in Dagbani, and they fail completely.

This is a data problem. The architectures exist. The compute exists. The labeled training data for Ghanaian languages does not.

Akan serves as a lingua franca across much of southern Ghana. Ewe is spoken across Ghana and Togo. Dagbani anchors communication across the Northern Region. Together, these languages underpin daily commerce, healthcare delivery, government services, and family life for tens of millions of people. The infrastructure to build AI systems that work in these languages has not kept pace with the demand.

That is the gap mGhana was built to close.

What mGhana Contains

mGhana-ST is an open-source speech dataset published on Hugging Face containing more than 7,800 curated audio samples across three Ghanaian languages: Twi, Ewe, and Ga. Each sample is paired with a verified English translation and tagged for non-verbal events including background noise, speaker overlap, and environmental context.

The dataset is engineered for four downstream tasks: automatic speech recognition (ASR), speech-to-text translation, speaker identification, and low-resource language modeling.

Alongside mGhana-ST, we maintain UGSpeechData on Hugging Face, a structured repository built from the University of Ghana's multilingual speech corpus. The original collection spans five Ghanaian languages (Akan, Ewe, Dagbani, Dagaare, and Ikposo) with more than 970,000 audio files totaling more than 5,300 hours of recorded speech and approximately 500 hours of transcription. We restructure subsets of this data for direct use in our pipeline and for public access.

Both datasets are MIT-licensed.

Adoption

In March 2026, mGhana-ST recorded more than 12,330 downloads on Hugging Face, making it one of the most downloaded African language speech datasets on the platform that month. That is 4,110 downloads per language, compared to 316 per language for Google's FLEURS (a global benchmark spanning 102 languages) and 526 per language for Google's WAXANLP (an African language speech corpus).

Every sample is annotated by software engineers and linguists with native fluency in the target language. The data captures tonal semantics, natural code-switching, and dialectal variation. That is the standard the market is selecting for.

The constraint on Ghanaian language AI has never been demand. It has been infrastructure.

Why the Annotation Methodology Matters

Building speech data for Twi or Ewe requires a fundamentally different annotation pipeline than building for English. The challenges are linguistic, not logistical.

Tonal semantics

Twi and Ewe are tonal languages where identical phoneme sequences carry different meanings depending on pitch contour. An annotation pipeline designed for non-tonal languages will produce data that trains models to recognize sounds without understanding meaning. Our pipeline captures tonal variation at the segment level, a requirement for any ASR system intended for real-world deployment.

Natural code-switching

Ghanaian speakers routinely shift between their native language and English within a single utterance. A customer service call might begin in Twi, shift to English for a technical term, and return to Twi mid-sentence. Most annotation frameworks treat this as noise. Ours treats it as signal, preserving the switching patterns that any production model will encounter.

Dialectal variation

Akan is not a single language but a continuum of closely related dialects: Asante Twi, Fante, Akuapem Twi, and others. A dataset that treats "Twi" as monolithic will train models that work in one dialect and fail in the next. Our annotation captures dialectal markers so downstream models can generalize across the family.

Non-verbal event tagging

Real-world audio contains background noise, overlapping speakers, laughter, pauses, and environmental interference. Datasets that strip these out produce models that perform well on clean recordings and collapse in production. Every sample in mGhana-ST is tagged for non-verbal events, ensuring the data trains systems built for actual operating conditions.

Commercial Applications

The commercial demand for Ghanaian language AI is not emerging. It is overdue.

Financial services

Ghana's mobile money ecosystem processed GHS 4.54 trillion in 2025, according to Bank of Ghana data. That is not a typo. Mobile money now handles nearly three times the combined transaction value of cheques, internet banking, instant pay, and ACH transfers in Ghana. MTN MoMo alone commands 73% market share. There are 74.1 million registered mobile money accounts in a country of 34 million people, with 59.7% of adults actively using digital wallets. The dominant transaction channel is USSD, accounting for 55% of all mobile money activity. Every one of those USSD sessions is a voice-adjacent interaction constrained by text menus in English. Voice-enabled banking in Twi, Ewe, and Ga would not merely improve the experience. It would unlock access for the millions of users who navigate these systems despite, not because of, the language they operate in. This is exactly the kind of system AdwumaTech builds: from annotated training datasets through to deployed voice and text interfaces.

Healthcare

Ghana's Community-based Health Planning and Services (CHPS) program operates more than 6,500 compounds across 5,062 functional zones, staffed by 2,523 trained Community Health Officers supported by 19,411 active Community Health Volunteers. These frontline workers deliver primary care, maternal health services, immunization, and health education in local languages at the household level. CHPS has been shown to reduce under-5 mortality, increase skilled birth attendance by 56%, and expand family planning adoption, particularly among the poorest populations. AI-powered clinical decision support, voice-based patient intake, and automated health information lines deployed through this network would reach communities where the nearest hospital is hours away. The missing piece is speech recognition and NLP that works in Twi, Ewe, Dagbani, and Ga. That is our engineering focus.

Government services

Ghana's digitization agenda under the Ghana Digital Acceleration Project targets expanded online access to public services. The National Identification Authority has registered nearly 19 million citizens in the Ghana Card system. The Ghana.GOV platform currently hosts more than 1,500 government services with plans to scale to 16,000, fully integrated with the Ghana Card for identity verification. For a multilingual population, the next layer is voice: interfaces in Twi, Ewe, Ga, Dagbani, and other national languages that make these systems accessible beyond English-literate urban populations. Without voice infrastructure, digital government risks replicating the exclusion it was designed to eliminate. AdwumaTech delivers the data annotation, model training, and custom AI development that makes multilingual government services operational.

Continental context

Mobile money transactions across Africa reached $1.43 trillion in 2025, a 27% year-on-year increase. The continent accounts for 66% of global mobile money transaction value and 74% of global transaction volume. West Africa alone processed $498 billion across 517 million accounts. This is not a niche. It is the largest mobile financial ecosystem on earth, and it operates overwhelmingly in languages that current AI systems cannot process. The infrastructure to change that, from dataset creation to production-ready applications, is what AdwumaTech exists to provide.

What Comes Next

The current release covers Twi, Ewe, and Ga. The roadmap includes expansion to Dagbani, Dagaare, and additional Ghanaian languages, with longer-term plans to extend coverage across West Africa.

The next release will incorporate expanded speaker demographics, longer-form conversational data, and domain-specific samples for financial services, healthcare, and government service delivery.

On the infrastructure side, we are building Nebula, an open-source API marketplace for AdwumaTech's engineering APIs. mGhana and UGSpeechData are the data layer. Nebula is the interface layer: developer-first SDKs, documentation, and community-driven contribution that lets developers integrate African language speech capabilities into their applications without rebuilding the pipeline.

Access the Data

mGhana-ST and UGSpeechData are available now on Hugging Face under MIT licenses.

For researchers and developers: Access mGhana-ST on Hugging Face. Load it directly:

from datasets import load_dataset

dataset = load_dataset("adwumatech-ai/mghana-st")

For enterprise organizations requiring African language datasets, annotation services, or speech AI training data: contact our team.

For government agencies, financial institutions, healthcare providers, and organizations requiring custom AI applications built on African language speech and text infrastructure: contact our team.

12,330 Downloads in March. Three Languages. One Dataset Built from Accra.