This is one of the alignments that the Korean team reported on Wednesday Oct 26. Spot checks suggest the process worked. 대~한민국!
Category Archives: Uncategorized
Verbmobil I alignment
This is one of the VM1 alignments that the German team reported on Friday, Oct. 21. Weiter so!
German data update
The German corpus is located at /projects/speech/corpus/VM1.
This directory contains numerous subdirectories for the various conversations in the corpus. Note that some of these are in Japanese, English, or a mix of English, German, and Denglisch – we only want data from directories starting with k, l, m, n, g, w (not all of which exist in the version of the corpus we have). There is documentation in the zipped file VM1/CLARINDocu.zip
.
The working directory for the VM1 recipe that we’re building is in kaldi-master/egs/vm1
. Here you will find our version of run.sh
and the various scripts it uses, most of which are in vm1/local
or in a pre-existing directory in kaldi-master somewhere.
The script local/vm1_data_prep.py
generates all of the necessary files in appropriate format for Kaldi to process and places them in data/
. Most of the data can be pulled either from the JSON files in the VM1 directory, or from the documentation and metadata in VM1/CLARINDocu.zip
, which contains a lexicon, speaker information, etc.
Spanish data update
Corpora for the Spanish Fisher-Callhome recipe are online, for instance /projects/ldc/ldc-standard-license/1996/LDC96T17/callhome_spanish_trans_970711
. Paths have been added to the earlier post.
Korean data update
The LDC Korean data is all on the server, except for the corpus from 1996, with files extracted. Paths have been added to the earlier post, e.g. /projects/ldc/ldc-standard-license/2003/LDC2003S0
. At the outset we are concentrating on the data from 2003.
Kaldi-alignments-matlab
https://github.com/MatsRooth/Kaldi-alignments-matlab is a Matlab program that displays Kaldi alignments like the ones we looked at in class yesterday. Install it from GitHub. You also need to install the voicebox toolbox and put it on the Matlab path. Open display_switch.m in the editor, run it from the GUI (with the default argument that is supplied) and accept the suggestion to switch the Matlab directory. Data from the Librispeech corpus are read from sub-directories that are included with the distribution.
The above worked for me on a Mac different from the one where I developed it. Try it, it’s a good way of checking qualitatively how well the alignment is working. We want to include code in the project directories on Kay that generates data files like the ones in the matlab-mat and matlab-dat sub-directories.
LDC licenses
Please print the LDC standard license from here.
At the bottom, write your name, netid and signature. Then scan it and load it back on confluence. Finally ask Stacy to add you to the en-cl-ldc group.
fsttopsort
fsttopsort topologically sorts its input if acyclic, modifying it. Otherwise, the input is unchanged. When sorted, all transitions are from lower to higher state IDs. (documentation source)
Cyclic Example:
fstprint –isymbols=words.txt –osymbols=words.txt L1.fst
0 1 <eps> <eps>
1 2 a a
1 3 f f
2 0 <eps> <eps>
3 2 d d
when we apply fsttopsort, we get a warning saying “input FST is cyclic”.
fsttopsort L1.fst L1sorted.fst
WARNING: fsttopsort: Input FST is cyclic
fstprint –isymbols=words.txt –osymbols=words.txt L1sorted.fst0 1 <eps> <eps>
1 2 a a
1 3 f f
2 0 <eps> <eps>
3 2 d d
As we expected, the output looks unchanged.
Acyclic Example:
fstprint –isymbols=words.txt –osymbols=words.txt L2.fst
1
0 2 <eps> <eps>
0 3 a a
2 5 f f
3 4 c c
4 6 b b
4 1 b b
5 3 d d
6 1 a a
as we run the operation,
fsttopsort L2.fst L2sorted.fst
fstprint –isymbols=words.txt –osymbols=words.txt L2sorted.fst6
0 1 <eps> <eps>
0 3 a a
1 2 f f
2 3 d d
3 4 c c
4 5 b b
4 6 b b
5 6 a a
we can see the state IDs topologically sorted.
(I produced all the images by running ‘fstdraw [fst filename] | dot -Tpng >[png filename]’)
fstequal
The fstequal command exits with a return code indicating whether or not the two compiled fsts passed in are equivalent. The return code will be 0 if they are equal, and nonzero if they are not.
The easiest way to see the return code is to type `echo $?` immediately after executing fstequal. This command shows the return code of the last executed command, and thus will show you a 0 if the two fsts are equal. For example, let’s c
onsider a perfectly straight fst:
jms852@kay:~$ fstprint straight.fst
0 0 0 0
0 1 1 0
1 2 0 0
2 3 0 0
3 3 0 0
3 3 1 0
jms852@kay:~$ fstequal straight.fst straight.fst
jms852@kay:~$ echo $?
0
Then let’s add another fst that forks off instead of continuing straight, and compare that one:
jms852@kay:~$ fstprint fork.fst
0 0 0 0
0 1 1 0
1 2 0 0
2 3 0 0
2 4 1 0
3 Infinity
4 Infinity
jms852@kay:~$ fstequal straight.fst fork.fst
jms852@kay:~$ echo $?
2
Two fsts can be equivalent even if they are not the exact same file, or if their edges are labelled differently. For example, if I create straight2.fst:
jms852@kay:~$ cat straight.txt
0 0 0 0
0 2 1 0
2 1 0 0
1 3 0 0
3 3 0 0
3 3 1 0
You can see that it is still straight, but the nodes have different labels.
However, we see that fstequal still says that they are equal
jms852@kay:~$ fstequal straight.fst straight2.fst
jms852@kay:~$ echo $?
0
fstprune
According to the documentation:
This operation deletes states and arcs in the input FST that do not belong to a successful path whose weight is no more (w.r.t the natural the natural semiring order) than the threshold t ⊗-times the weight of the shortest path in the input FST.
Weights need to be commutative and have the path property. Both destructive and constructive implemenations are available
Example:
The fst:
0 0 0 0 0.699999988
0 1 0 0 0.299899995
0 2 0 0 9.99999975e-05
1
2 Infinity
After running
fstprune –weight=3 unpruned.fst pruned.fst
generates the new fst
0 0 0 0 0.699999988
0 1 0 0 0.299899995
1
Which has had the state 2 removed, as well as the transition to that state.
The demo can be found at /projects/speech/sys/kaldi-master/egs/rm/s5-avt26/demo/