German data update

The German corpus is located at /projects/speech/corpus/VM1. This directory contains numerous subdirectories for the various conversations in the corpus. Note that some of these are in Japanese, English, or a mix of English, German, and Denglisch – we only want data from directories starting with k, l, m, n, g, w (not all of which exist in the version of the corpus we have). There is documentation in the zipped file VM1/CLARINDocu.zip.

The working directory for the VM1 recipe that we’re building is in kaldi-master/egs/vm1. Here you will find our version of run.sh and the various scripts it uses, most of which are in vm1/local or in a pre-existing directory in kaldi-master somewhere.

The script local/vm1_data_prep.py generates all of the necessary files in appropriate format for Kaldi to process and places them in data/. Most of the data can be pulled either from the JSON files in the VM1 directory, or from the documentation and metadata in VM1/CLARINDocu.zip, which contains a lexicon, speaker information, etc.

One thought on “German data update”

Mats Rooth September 28, 2016 at 18:42

These are some comments on data/local/dict/lexicon.txt.
1. There are entries like the one below that look too short in the phonetics.
Abendstunden Q ‘a
2. Consider re-coding the stress marks with numbers, e.g. Q a1 p g @ l a2 U f @ n. I’m assuming the double quote is secondary stress.
3. Possibly some vowel sequences should be diphthongs, e.g. Q a1 p g @ l aU2 f @ n. Look for documentation or papers about it.
4. Is there documentation on the Q phone? Is it a glottal stop?

↓

Finite State Methods

LING 4485/6485 Fall 2016

One thought on “German data update”

Leave a Reply Cancel reply