This is one of the alignments that the Korean team reported on Wednesday Oct 26. Spot checks suggest the process worked. 대~한민국!
Verbmobil I alignment
This is one of the VM1 alignments that the German team reported on Friday, Oct. 21. Weiter so!
German data update
The German corpus is located at /projects/speech/corpus/VM1.
This directory contains numerous subdirectories for the various conversations in the corpus. Note that some of these are in Japanese, English, or a mix of English, German, and Denglisch – we only want data from directories starting with k, l, m, n, g, w (not all of which exist in the version of the corpus we have). There is documentation in the zipped file VM1/CLARINDocu.zip
.
The working directory for the VM1 recipe that we’re building is in kaldi-master/egs/vm1
. Here you will find our version of run.sh
and the various scripts it uses, most of which are in vm1/local
or in a pre-existing directory in kaldi-master somewhere.
The script local/vm1_data_prep.py
generates all of the necessary files in appropriate format for Kaldi to process and places them in data/
. Most of the data can be pulled either from the JSON files in the VM1 directory, or from the documentation and metadata in VM1/CLARINDocu.zip
, which contains a lexicon, speaker information, etc.
Spanish data update
Corpora for the Spanish Fisher-Callhome recipe are online, for instance /projects/ldc/ldc-standard-license/1996/LDC96T17/callhome_spanish_trans_970711
. Paths have been added to the earlier post.
Korean data update
The LDC Korean data is all on the server, except for the corpus from 1996, with files extracted. Paths have been added to the earlier post, e.g. /projects/ldc/ldc-standard-license/2003/LDC2003S0
. At the outset we are concentrating on the data from 2003.
utils/prepare_lang.sh
Before I was able to get this command (Step 5 from Mats’ run.sh
), we needed some files that were weren’t aware of. In total, you will need the following files before this command will work. All paths are relative to run.sh.
- data/local/dict/lexicon.txt
- data/local/dict/nonsilence_phones.txt
- data/local/dict/optional_silence.txt
- data/local/dict/silence_phones.txt
- path.sh
prepare_lang.sh
will complain if you don’t have any one of these. The complaint for path.sh is a little less clear, since not having this file seems to result in other errors.
lexicon.txt
contains a lexical entry on each line which consists of a word, a space, and then the phones in that word, separated by spaces.
nonsilence_phones.txt
contains one phone symbol per line.
optional_silence.txt
contains the symbol for an optional silence. This is just sil
, but the file still needs to exist. Make sure that there is a newline at the end.
silence_phones.txt
can be identical to optional_silence.txt
path.sh
can be copied from RM, though you may need to edit the KALDI_ROOT variable, since this is a relative path.
The German versions of all of these can be seen in kaldi-master/egs/vm1.
Kaldi-alignments-matlab
https://github.com/MatsRooth/Kaldi-alignments-matlab is a Matlab program that displays Kaldi alignments like the ones we looked at in class yesterday. Install it from GitHub. You also need to install the voicebox toolbox and put it on the Matlab path. Open display_switch.m in the editor, run it from the GUI (with the default argument that is supplied) and accept the suggestion to switch the Matlab directory. Data from the Librispeech corpus are read from sub-directories that are included with the distribution.
The above worked for me on a Mac different from the one where I developed it. Try it, it’s a good way of checking qualitatively how well the alignment is working. We want to include code in the project directories on Kay that generates data files like the ones in the matlab-mat and matlab-dat sub-directories.
LDC licenses
Please print the LDC standard license from here.
At the bottom, write your name, netid and signature. Then scan it and load it back on confluence. Finally ask Stacy to add you to the en-cl-ldc group.
Language data roundup
We are almost set with initial language data for German, Korean, and Spanish.
- For German, it’s Verbmobil I. The data on Kay include audio, orthographic transcrips, and phonetic transcripts. It seems to be all we need. See
/projects/speech/corpus/VM1
. - For Korean, Bruce has installed
LDC2003S07 Korean Telephone Conversations Complete Set on Kay.
It needs to be verified, but apparently it includes speech, orthographic transcript, and corresponding lexicon.
Correction: the combined corpus is not there yet, but the individual parts are. See these locations on kay (lexicon, speech, transcript).
/projects/ldc/ldc-standard-license/2003/LDC2003L02
/projects/ldc/ldc-standard-license/2003/LDC2003S03
/projects/ldc/ldc-standard-license/2003/LDC2003T08
- For Spanish, we are buying the CALLHOME lexicon, speech, and transcript. Fisher Spanish is already installed on kay. In the egs directory, there is already a script for Fisher Spanish+Callhome. Rather than pretending it’s not done already, let’s focus on building on it to add more data, improve the morphology, and whatever.
Naomi Enzinna, has joined the Spanish group. Prof. Jiwon Yun from Stony Brook will collaborate on Korean. She already has a good initial solution for grapheme to phoneme conversion, which could make it possible to include the Korean Broadcast News/Transcripts, where we need the phonetic lexicon.
German — Verbmobil I
We’ve obtained German data from this 1990’s speech translation project, courtesy of the BAS CLARIN repository. Orthographic and phonetic transcripts are included. See
/projects/speech/corpus/VM1
.