German data update

The German corpus is located at /projects/speech/corpus/VM1. This directory contains numerous subdirectories for the various conversations in the corpus. Note that some of these are in Japanese, English, or a mix of English, German, and Denglisch – we only want data from directories starting with k, l, m, n, g, w (not all of which exist in the version of the corpus we have). There is documentation in the zipped file VM1/CLARINDocu.zip.

The working directory for the VM1 recipe that we’re building is in kaldi-master/egs/vm1. Here you will find our version of run.sh and the various scripts it uses, most of which are in vm1/local or in a pre-existing directory in kaldi-master somewhere.

The script local/vm1_data_prep.py generates all of the necessary files in appropriate format for Kaldi to process and places them in data/. Most of the data can be pulled either from the JSON files in the VM1 directory, or from the documentation and metadata in VM1/CLARINDocu.zip, which contains a lexicon, speaker information, etc.

One thought on “German data update

  1. Mats Rooth

    These are some comments on data/local/dict/lexicon.txt.
    1. There are entries like the one below that look too short in the phonetics.
    Abendstunden Q ‘a
    2. Consider re-coding the stress marks with numbers, e.g. Q a1 p g @ l a2 U f @ n. I’m assuming the double quote is secondary stress.
    3. Possibly some vowel sequences should be diphthongs, e.g. Q a1 p g @ l aU2 f @ n. Look for documentation or papers about it.
    4. Is there documentation on the Q phone? Is it a glottal stop?

Leave a Reply