The German corpus is located at /projects/speech/corpus/VM1.
This directory contains numerous subdirectories for the various conversations in the corpus. Note that some of these are in Japanese, English, or a mix of English, German, and Denglisch – we only want data from directories starting with k, l, m, n, g, w (not all of which exist in the version of the corpus we have). There is documentation in the zipped file VM1/CLARINDocu.zip
.
The working directory for the VM1 recipe that we’re building is in kaldi-master/egs/vm1
. Here you will find our version of run.sh
and the various scripts it uses, most of which are in vm1/local
or in a pre-existing directory in kaldi-master somewhere.
The script local/vm1_data_prep.py
generates all of the necessary files in appropriate format for Kaldi to process and places them in data/
. Most of the data can be pulled either from the JSON files in the VM1 directory, or from the documentation and metadata in VM1/CLARINDocu.zip
, which contains a lexicon, speaker information, etc.