Category Archives: Uncategorized

German data update

The German corpus is located at /projects/speech/corpus/VM1. This directory contains numerous subdirectories for the various conversations in the corpus. Note that some of these are in Japanese, English, or a mix of English, German, and Denglisch – we only want data from directories starting with k, l, m, n, g, w (not all of which exist in the version of the corpus we have). There is documentation in the zipped file VM1/CLARINDocu.zip.

The working directory for the VM1 recipe that we’re building is in kaldi-master/egs/vm1. Here you will find our version of run.sh and the various scripts it uses, most of which are in vm1/local or in a pre-existing directory in kaldi-master somewhere.

The script local/vm1_data_prep.py generates all of the necessary files in appropriate format for Kaldi to process and places them in data/. Most of the data can be pulled either from the JSON files in the VM1 directory, or from the documentation and metadata in VM1/CLARINDocu.zip, which contains a lexicon, speaker information, etc.

Korean data update

The LDC Korean data is all on the server, except for the corpus from 1996, with files extracted. Paths have been added to the earlier post, e.g. /projects/ldc/ldc-standard-license/2003/LDC2003S0. At the outset we are concentrating on the data from 2003.

Kaldi-alignments-matlab

https://github.com/MatsRooth/Kaldi-alignments-matlab is a Matlab program that displays Kaldi alignments like the ones we looked at in class yesterday. Install it from GitHub. You also need to install the voicebox toolbox and put it on the Matlab path. Open display_switch.m in the editor, run it from the GUI (with the default argument that is supplied) and accept the suggestion to switch the Matlab directory. Data from the Librispeech corpus are read from sub-directories that are included with the distribution.

The above worked for me on a Mac different from the one where I developed it. Try it, it’s a good way of checking qualitatively how well the alignment is working. We want to include code in the project directories on Kay that generates data files like the ones in the matlab-mat and matlab-dat sub-directories.

fsttopsort

fsttopsort topologically sorts its input if acyclic, modifying it. Otherwise, the input is unchanged. When sorted, all transitions are from lower to higher state IDs. (documentation source)

Cyclic Example:

fstprint –isymbols=words.txt –osymbols=words.txt L1.fst

0 1 <eps> <eps>
1 2 a a
1 3 f f
2 0 <eps> <eps>
3 2 d d

cyclicFST

when we apply fsttopsort, we get a warning saying “input FST is cyclic”.

fsttopsort L1.fst L1sorted.fst
WARNING: fsttopsort: Input FST is cyclic
fstprint –isymbols=words.txt –osymbols=words.txt L1sorted.fst

0 1 <eps> <eps>
1 2 a a
1 3 f f
2 0 <eps> <eps>
3 2 d d

cyclicFSTSorted

As we expected, the output looks unchanged.

Acyclic Example:

fstprint –isymbols=words.txt –osymbols=words.txt L2.fst

1
0 2 <eps> <eps>
0 3 a a
2 5 f f
3 4 c c
4 6 b b
4 1 b b
5 3 d d
6 1 a a

acyclicFST

as we run the operation,

fsttopsort L2.fst L2sorted.fst
fstprint –isymbols=words.txt –osymbols=words.txt L2sorted.fst

6
0 1 <eps> <eps>
0 3 a a
1 2 f f
2 3 d d
3 4 c c
4 5 b b
4 6 b b
5 6 a a

acyclicFSTSorted

we can see the state IDs topologically sorted.

(I produced all the images by running ‘fstdraw [fst filename] | dot -Tpng >[png filename]’)

fstequal

The fstequal command exits with a return code indicating whether or not the two compiled fsts passed in are equivalent. The return code will be 0 if they are equal, and nonzero if they are not.

The easiest way to see the return code is to type `echo $?` immediately after executing fstequal. This command shows the return code of the last executed command, and thus will show you a 0 if the two fsts are equal. For example, let’s c
onsider a perfectly straight fst:

jms852@kay:~$ fstprint straight.fst
0 0 0 0
0 1 1 0
1 2 0 0
2 3 0 0
3 3 0 0
3 3 1 0
jms852@kay:~$ fstequal straight.fst straight.fst
jms852@kay:~$ echo $?
0

Then let’s add another fst that forks off instead of continuing straight, and compare that one:

jms852@kay:~$ fstprint fork.fst
0 0 0 0
0 1 1 0
1 2 0 0
2 3 0 0
2 4 1 0
3 Infinity
4 Infinity
jms852@kay:~$ fstequal straight.fst fork.fst
jms852@kay:~$ echo $?
2

Two fsts can be equivalent even if they are not the exact same file, or if their edges are labelled differently. For example, if I create straight2.fst:

jms852@kay:~$ cat straight.txt
0 0 0 0

0 2 1 0

2 1 0 0

1 3 0 0

3 3 0 0

3 3 1 0

You can see that it is still straight, but the nodes have different labels.

However, we see that fstequal still says that they are equal

jms852@kay:~$ fstequal straight.fst straight2.fst
jms852@kay:~$ echo $?
0

fstprune

According to the documentation:

This operation deletes states and arcs in the input FST that do not belong to a successful path whose weight is no more (w.r.t the natural the natural semiring order) than the threshold t ⊗-times the weight of the shortest path in the input FST.

Weights need to be commutative and have the path property. Both destructive and constructive implemenations are available

Example:

 

 

 

unprunedfst

The fst:

0 0 0 0 0.699999988
0 1 0 0 0.299899995
0 2 0 0 9.99999975e-05
1
2 Infinity

After running

fstprune –weight=3 unpruned.fst pruned.fst

generates the new fst

0 0 0 0 0.699999988
0 1 0 0 0.299899995
1

Which has had the state 2 removed, as well as the transition to that state.

The demo can be found at /projects/speech/sys/kaldi-master/egs/rm/s5-avt26/demo/