Open Source Acoustic Models for German Distant Speech Recognition

Open acoustic models and speech data for German speech recognition

In the course of the BMBF project Dialog+, the LT and the Teleccoperation group have developed acoustic models for German distant speech recognition. These have been built with the open source software toolkits Sphinx and Kaldi. Unfortunately, German data resources needed to train such acoustic are rarely open source and easily accessible. We thus decided to record our own German speech data corpus, which we have now released under an open source license (CC-BY). Pretrained models and scripts to generate those are also available (see download links below) and are released under the same permissive CC-BY license. Update: we continued our efforts to build open source German acoustic models at Universität Hamburg and added speech data from the Spoken Wikipedia Corpus (SWC) project to our training recipes. Note that we discontinued pretrained Sphinx models and only offer pretrained Kaldi models from now on, since Kaldi has become the defacto standard toolkit for (open source) automatic speech recognition.

The recording of our own speech data corpus was supported by the BMBF project dialog+:

Project homepage: dialogplus.eu
Github project page with new and current Kaldi models: kaldi-tuda-de project

**Summary of collected data (March 2015)**
Overall duration per microphone:	about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test)
Count of microphones:	3 (Microsoft Kinect, Yamaha, Samson)
Count of wave-files per microphone:	about 14500
Overall count of participations:	180 (130 male / 50 female)

What is the difference to the freely available German Voxforge corpus?

We have recorded all our speech data under controlled conditions: same room, same microphone distances, ...
We recorded with three microphones in parallel. An additional signal was recorded with enabled beamforming and noise reduction (Microsoft Kinect).
The data is curated, to reduce speaking errors and artefacts.

Downloads

Complete speechdata-corpus 2014/2015 (wave&xml-files, tar.gz, ~17 GB, version 2)
Additional text resources for a German language model (gzip-txt, 8 million sentences)
New Github project page with current models and Kaldi training recipes: kaldi-tuda-de project

People

Chris Biemann
Dirk Schnelle-Walka (now at S1nn GmbH & Co KG)
Stephan Radeck-Arneth (now at MaibornWolff GmbH)
Benjamin Milde

Language Technology Group (LT)

Open Source Acoustic Models for German Distant Speech Recognition

Open acoustic models and speech data for German speech recognition

What is the difference to the freely available German Voxforge corpus?

Downloads

People

News