German data update

The German corpus is located at /projects/speech/corpus/VM1. This directory contains numerous subdirectories for the various conversations in the corpus. Note that some of these are in Japanese, English, or a mix of English, German, and Denglisch – we only want data from directories starting with k, l, m, n, g, w (not all of which exist in the version of the corpus we have). There is documentation in the zipped file VM1/CLARINDocu.zip.

The working directory for the VM1 recipe that we’re building is in kaldi-master/egs/vm1. Here you will find our version of run.sh and the various scripts it uses, most of which are in vm1/local or in a pre-existing directory in kaldi-master somewhere.

The script local/vm1_data_prep.py generates all of the necessary files in appropriate format for Kaldi to process and places them in data/. Most of the data can be pulled either from the JSON files in the VM1 directory, or from the documentation and metadata in VM1/CLARINDocu.zip, which contains a lexicon, speaker information, etc.

Korean data update

The LDC Korean data is all on the server, except for the corpus from 1996, with files extracted. Paths have been added to the earlier post, e.g. /projects/ldc/ldc-standard-license/2003/LDC2003S0. At the outset we are concentrating on the data from 2003.

utils/prepare_lang.sh

Before I was able to get this command (Step 5 from Mats’ run.sh), we needed some files that were weren’t aware of. In total, you will need the following files before this command will work. All paths are relative to run.sh.

  • data/local/dict/lexicon.txt
  • data/local/dict/nonsilence_phones.txt
  • data/local/dict/optional_silence.txt
  • data/local/dict/silence_phones.txt
  • path.sh

prepare_lang.sh will complain if you don’t have any one of these. The complaint for path.sh is a little less clear, since not having this file seems to result in other errors.

lexicon.txt contains a lexical entry on each line which consists of a word, a space, and then the phones in that word, separated by spaces.

nonsilence_phones.txt contains one phone symbol per line.

optional_silence.txt contains the symbol for an optional silence. This is just sil, but the file still needs to exist. Make sure that there is a newline at the end.

silence_phones.txt can be identical to optional_silence.txt

path.sh can be copied from RM, though you may need to edit the KALDI_ROOT variable, since this is a relative path.

The German versions of all of these can be seen in kaldi-master/egs/vm1.

Kaldi-alignments-matlab

https://github.com/MatsRooth/Kaldi-alignments-matlab is a Matlab program that displays Kaldi alignments like the ones we looked at in class yesterday. Install it from GitHub. You also need to install the voicebox toolbox and put it on the Matlab path. Open display_switch.m in the editor, run it from the GUI (with the default argument that is supplied) and accept the suggestion to switch the Matlab directory. Data from the Librispeech corpus are read from sub-directories that are included with the distribution.

The above worked for me on a Mac different from the one where I developed it. Try it, it’s a good way of checking qualitatively how well the alignment is working. We want to include code in the project directories on Kay that generates data files like the ones in the matlab-mat and matlab-dat sub-directories.

Language data roundup

We are almost set with initial language data for German, Korean, and Spanish.

  1. For German, it’s Verbmobil I. The data on Kay include audio, orthographic transcrips, and phonetic transcripts. It seems to be all we need. See /projects/speech/corpus/VM1.
  2. For Korean, Bruce has installed
    LDC2003S07 Korean Telephone Conversations Complete Set on Kay.
    It needs to be verified, but apparently it includes speech, orthographic transcript, and corresponding lexicon.
    Correction: the combined corpus is not there yet, but the individual parts are. See these locations on kay (lexicon, speech, transcript).

    /projects/ldc/ldc-standard-license/2003/LDC2003L02
    /projects/ldc/ldc-standard-license/2003/LDC2003S03
    /projects/ldc/ldc-standard-license/2003/LDC2003T08

  3. For Spanish, we are buying the CALLHOME lexicon, speech, and transcript. Fisher Spanish is already installed on kay. In the egs directory, there is already a script for Fisher Spanish+Callhome. Rather than pretending it’s not done already, let’s focus on building on it to add more data, improve the morphology, and whatever.

Naomi Enzinna, has joined the Spanish group. Prof. Jiwon Yun from Stony Brook will collaborate on Korean. She already has a good initial solution for grapheme to phoneme conversion, which could make it possible to include the Korean Broadcast News/Transcripts, where we need the phonetic lexicon.