Although AI feels everywhere today, it still works with only a tiny fraction of the world’s 7,000 languages—leaving millions without proper access. NVIDIA now wants to close this gap, with a special focus on Europe.
The company has introduced a new collection of open-source tools that allow developers to build advanced speech AI for 25 European languages. While major tongues are included, the real breakthrough is in support for underrepresented ones such as Croatian, Estonian, and Maltese.
The mission is clear: empower developers to create the kind of voice-enabled tools many already use daily—chatbots that understand, customer support systems that respond quickly, and translation services that work almost instantly.
At the centre of this launch is Granary, a massive library of human speech. With roughly a million hours of curated audio, it is designed to help AI grasp the subtleties of recognition and translation across languages.
To unlock Granary’s potential, NVIDIA has also released two specialised AI models:
Canary-1b-v2, a large model optimised for high-accuracy transcription and translation.Parakeet-tdt-0.6b-v3, tuned for real-time speed where low latency is essential.
For those eager to explore the research, the Granary paper will be unveiled at the Interspeech conference in the Netherlands this month. Meanwhile, the dataset and models are already available to developers on Hugging Face.
The real innovation lies in how this dataset was created. Training AI usually requires painstaking manual annotation, which is both costly and slow. To overcome this, NVIDIA’s speech AI team—partnering with Carnegie Mellon University and Fondazione Bruno Kessler—built an automated pipeline. Using the NeMo toolkit, they transformed raw audio into structured, AI-ready data with minimal human labour.
This marks a major step forward for digital inclusion. Developers in cities like Riga or Zagreb can now build reliable, voice-driven applications in their own language—faster and with fewer resources. Tests showed that Granary requires nearly half the data volume to achieve the same accuracy compared to other leading datasets.
The two models highlight this efficiency. Canary delivers translation and transcription accuracy on par with models three times larger, while operating up to ten times faster. Parakeet can process an entire 24-minute meeting recording seamlessly, automatically recognising the spoken language. Both models also manage punctuation, capitalisation, and word-level timestamps—features crucial for professional tools.
By releasing both the data and methodology openly, NVIDIA isn’t just offering another product. It’s igniting a new era of speech AI innovation, aiming for a future where technology understands every language—no matter where in the world it’s spoken.