The German corpus is located at /projects/speech/corpus/VM1.
This directory contains numerous subdirectories for the various conversations in the corpus. Note that some of these are in Japanese, English, or a mix of English, German, and Denglisch – we only want data from directories starting with k, l, m, n, g, w (not all of which exist in the version of the corpus we have). There is documentation in the zipped file VM1/CLARINDocu.zip
.
The working directory for the VM1 recipe that we’re building is in kaldi-master/egs/vm1
. Here you will find our version of run.sh
and the various scripts it uses, most of which are in vm1/local
or in a pre-existing directory in kaldi-master somewhere.
The script local/vm1_data_prep.py
generates all of the necessary files in appropriate format for Kaldi to process and places them in data/
. Most of the data can be pulled either from the JSON files in the VM1 directory, or from the documentation and metadata in VM1/CLARINDocu.zip
, which contains a lexicon, speaker information, etc.
These are some comments on data/local/dict/lexicon.txt.
1. There are entries like the one below that look too short in the phonetics.
Abendstunden Q ‘a
2. Consider re-coding the stress marks with numbers, e.g. Q a1 p g @ l a2 U f @ n. I’m assuming the double quote is secondary stress.
3. Possibly some vowel sequences should be diphthongs, e.g. Q a1 p g @ l aU2 f @ n. Look for documentation or papers about it.
4. Is there documentation on the Q phone? Is it a glottal stop?