Author Archives: Jacob Collard

German data update

The German corpus is located at /projects/speech/corpus/VM1. This directory contains numerous subdirectories for the various conversations in the corpus. Note that some of these are in Japanese, English, or a mix of English, German, and Denglisch – we only want data from directories starting with k, l, m, n, g, w (not all of which exist in the version of the corpus we have). There is documentation in the zipped file VM1/CLARINDocu.zip.

The working directory for the VM1 recipe that we’re building is in kaldi-master/egs/vm1. Here you will find our version of run.sh and the various scripts it uses, most of which are in vm1/local or in a pre-existing directory in kaldi-master somewhere.

The script local/vm1_data_prep.py generates all of the necessary files in appropriate format for Kaldi to process and places them in data/. Most of the data can be pulled either from the JSON files in the VM1 directory, or from the documentation and metadata in VM1/CLARINDocu.zip, which contains a lexicon, speaker information, etc.

utils/prepare_lang.sh

Before I was able to get this command (Step 5 from Mats’ run.sh), we needed some files that were weren’t aware of. In total, you will need the following files before this command will work. All paths are relative to run.sh.

  • data/local/dict/lexicon.txt
  • data/local/dict/nonsilence_phones.txt
  • data/local/dict/optional_silence.txt
  • data/local/dict/silence_phones.txt
  • path.sh

prepare_lang.sh will complain if you don’t have any one of these. The complaint for path.sh is a little less clear, since not having this file seems to result in other errors.

lexicon.txt contains a lexical entry on each line which consists of a word, a space, and then the phones in that word, separated by spaces.

nonsilence_phones.txt contains one phone symbol per line.

optional_silence.txt contains the symbol for an optional silence. This is just sil, but the file still needs to exist. Make sure that there is a newline at the end.

silence_phones.txt can be identical to optional_silence.txt

path.sh can be copied from RM, though you may need to edit the KALDI_ROOT variable, since this is a relative path.

The German versions of all of these can be seen in kaldi-master/egs/vm1.

Fstclosure

The fstclosure command implements Kleene closure/Kleene star. That is, it converts a set of strings into the set of strings consisting of zero or more repetitions of strings in the input set. The command can also be used to emulate the “+” operator with --closure_plus flag.

For example, given a simple automaton representing the regular expression a(b|c):

========================
Initial automaton
fstprint --osymbols=words.txt --isymbols=words.txt L.fst
0 1 a a
1 2 b b
1 3 c c
2 4 <eps> <eps>
3 4 <eps> <eps>
4

The language a(b|c)

The language a(b|c)

Running fstclosure produces (a(b|c))*:

========================
Run fstclosure for Kleene Star
fstclosure L.fst Lstar.fst
fstprint --osymbols=words.txt --isymbols=words.txt Lstar.fst
5 0 <eps> <eps>
5
0 1 a a
1 2 b b
1 3 c c
2 4 <eps> <eps>
3 4 <eps> <eps>
4 0 <eps> <eps>
4
========================

(a(b|c))*

The language (a(b|c))*

While running fstclosure --closure_plus produces (a(b|c))+.

========================
Run fstclosure for Kleene plus
fstclosure --closure_plus L.fst Lplus.fst
fstprint --osymbols=words.txt --isymbols=words.txt Lplus.fst
0 1 a a
1 2 b b
1 3 c c
2 4 <eps> <eps>
3 4 <eps> <eps>
4 0 <eps> <eps>
4

(a(b|c))+

The language (a(b|c))+

Installing FOMA

Foma is a finite-state toolkit that is mostly compatible with XSFT. In addition to an interface for building and displaying finite-state machines, Foma includes a C API for developers. The documentation for Foma is available online here.

Windows

A Windows binary is available here. Download and unzip the file, and place the executables foma.exe, flookup.exe, and cgflookup.exe somewhere in your Path. You can inspect and modify your PATH variable from My Computer > Properties > Advanced > Environment Variables > System Variables. The executables can then be invoked from any directory via the command line.

Note: the system command in Foma will not work unless you have Cygwin installed.

Mac

An OSX binary is available here. Download and unpack the file using tar -xvfz. Then, move the binaries foma and flookup to a directory in your PATH. An easy way to do this it to just move them to /usr/bin/, via the command sudo cp ./foma /usr/bin/.; sudo cp ./flookup /usr/bin/.

Linux

Recent versions of Ubuntu, starting with 16.04 Xenial Xerus, include Foma in the repositories. It can easily be installed via sudo apt install foma-bin. For the C API, you will also need libfoma-dev.

If you don’t have Ubuntu 16.04 (or a different Linux distro which includes Foma), binaries are available online. Choose either the 64 or 32-bit release. Download and unpack using tar -xvfz. Then, move the binaries foma, flookup, and cgflookup to a directory in your PATH (e.g., /usr/bin/).

Source

The source code is available here. Download and unpack, then run make; sudo make install. This will place the resulting binary in /usr/local/. If you edit the parser or the lexer, you may additionally need flex and bison, which are very popular lexer/parser tools for C. If you’re on Windows, this may be a much more complicated process, and you should use the binaries if possible.