Category Archives: Kaldi

utils/prepare_lang.sh

Before I was able to get this command (Step 5 from Mats’ run.sh), we needed some files that were weren’t aware of. In total, you will need the following files before this command will work. All paths are relative to run.sh.

  • data/local/dict/lexicon.txt
  • data/local/dict/nonsilence_phones.txt
  • data/local/dict/optional_silence.txt
  • data/local/dict/silence_phones.txt
  • path.sh

prepare_lang.sh will complain if you don’t have any one of these. The complaint for path.sh is a little less clear, since not having this file seems to result in other errors.

lexicon.txt contains a lexical entry on each line which consists of a word, a space, and then the phones in that word, separated by spaces.

nonsilence_phones.txt contains one phone symbol per line.

optional_silence.txt contains the symbol for an optional silence. This is just sil, but the file still needs to exist. Make sure that there is a newline at the end.

silence_phones.txt can be identical to optional_silence.txt

path.sh can be copied from RM, though you may need to edit the KALDI_ROOT variable, since this is a relative path.

The German versions of all of these can be seen in kaldi-master/egs/vm1.

fstrandgen

Based on the documentation:

This operation randomly generates a set of successful paths in the input FST. The operation relies on an ArcSelector object for randomly selecting an outgoing transition at a given state in the input FST. The default arc selector, UniformArcSelector, randomly selects a transition using the uniform distribution. LogProbArcSelector randomly selects a transition w.r.t. the weights treated as negative log probabilities after normalizing for the total weight leaving the state. In all cases, finality is treated as a transition to a super-final state.

_____________________________________

Example:

fstrandgen G.fst rand1.fst

fstprint --acceptor --isymbols=words.txt rand1.fst

0 1 WILL
1 2 WABASH
2 3 SEVENTEENTH
3 4 OCTOBER
4 5
5

fstdraw --acceptor --isymbols=words.txt rand1.fst | dot -Tx11

rand2.fst

My demo is under /projects/speech/sys/kaldi-master/egs/rm/s5-sb2295/demo/fstrandgen and you can run it with source demo.sh

Kaldi – show-alignments

Once you have your alignments you might need to retrieve data from them.

Kaldi’s show-alignments generates an alignment file that is “readable for humans”. Here’s how to invoke it:

show-alignments $basePath”/data/lang/phones.txt” final.mdl ark:ali.1 > ali.1.txt
show-alignments $basePath”/data/lang/phones.txt” final.mdl ark:ali.2 > ali.2.txt
show-alignments $basePath”/data/lang/phones.txt” final.mdl ark:ali.3 > ali.3.txt
show-alignments $basePath”/data/lang/phones.txt” final.mdl ark:ali.4 > ali.4.txt

  1. The file phones.txt is in data/lang/;
  2. The file final.mdl is in exp/mono_ali/;
  3. The files ali.1, ali.2, ali.3, ali.4 are in exp/mono_ali/. They have to be unziped (gunzip) before being used.

Before running show-alignments, the alignment files look like this:

f01br16b22k1-s003 3 12 18 17 1826 1825 1825 1825 1828 1830 1829 1829 1829 1829 1256 1258 1257 1257 1260 1259 1259 1259 1259 1259 1259 362 361 361 361 364 363 363 363 366 2750 2749 2749 2749 2749 2749 2749 2749 2749 2752 2751 2751 2754 2753 2753 2753 1928 1927 1927 1930 1929 1929 1932 1931 1931 1931 1931 980 982 984 1826 1825 1825 1825 1828 1827 1827 1827 1830 1829 2678 2677 2677 2677 2680 2679 2682 2681 2486 2485 2485 2485 2485 2485 2485 2488 2487 2487 2490 2489 2489 2489 2504 2503 2503 2503 2503 2506 2508 2738 2737 2737 2737 2737 2740 2739 2739 2739 2742 974 973 973 973 973 976 978 1814 1816 1818 2504 2503 2503 2506 2505 2508 2507 2507 2507 4 14 15 15 15 15 15 15 12 10 10 10 10 10 10 10 10 10 10 10 10 10 10 18 17 17 17 17 17 17 17 17 17

The above is utterance number 3 (s003) for female speaker 1 (f01br16b22k1) in the West Point Brazilian Portuguese LDC Corpus.

After your run show-alignments, you will see:

f01br16b22k1-s003  [ 3 12 18 17 ] [ 1826 1825 1825 1825 1828 1830 1829 1829 1829 1829 ] [ 1256 1258 1257 1257 1260 1259 1259 1259 1259 1259 1259 ] [ 362 361 361 361 364 363 363 363 366 ] [ 2750 2749 2749 2749 2749 2749 2749 2749 2749 2752 2751 2751 2754 2753 2753 2753 ] [ 1928 1927 1927 1930 1929 1929 1932 1931 1931 1931 1931 ] [ 980 982 984 ] [ 1826 1825 1825 1825 1828 1827 1827 1827 1830 1829 ] [ 2678 2677 2677 2677 2680 2679 2682 2681 ] [ 2486 2485 2485 2485 2485 2485 2485 2488 2487 2487 2490 2489 2489 2489 ] [ 2504 2503 2503 2503 2503 2506 2508 ] [ 2738 2737 2737 2737 2737 2740 2739 2739 2739 2742 ] [ 974 973 973 973 973 976 978 ] [ 1814 1816 1818 ] [ 2504 2503 2503 2506 2505 2508 2507 2507 2507 ] [ 4 14 15 15 15 15 15 15 12 10 10 10 10 10 10 10 10 10 10 10 10 10 10 18 17 17 17 17 17 17 17 17 17 ]

f01br16b22k1-s003  SIL            m_B                                                   ew1_E                                                      a_B                                     v_I                                                                                 o1_E                                                       eh1_S           m_B                                                   ujn1_I                                      t_I                                                                       u_E                                    v_B                                                   eh1_I                           lj_I               u_E                                              SIL 

The above are the transition IDs (per frame) of each phone, and the phone is the text below – always aligned with the left-most TID – transition IDs are integers that encode the PDFs (probability density function), the phone identity, and information about self-loops or forward transitions.

The text part above corresponds to phones and the position they occupy in a word: _B is the beginning of a word, _E is the end, _I is word-internal phone, and _S is a singleton (word with one sound only).

The file above can be tweaked using your preference of sed awk, or similar, to look like this:

m ew1   a v o1  eh1     m ujn1 t u      v eh1 lj u

These are the aligned words of utterance s003 for female speaker 1 above (literal translation: “My grandfather is very old” – don’t blame the speaker, they were reading prompts for the corpus)

Fstclosure

The fstclosure command implements Kleene closure/Kleene star. That is, it converts a set of strings into the set of strings consisting of zero or more repetitions of strings in the input set. The command can also be used to emulate the “+” operator with --closure_plus flag.

For example, given a simple automaton representing the regular expression a(b|c):

========================
Initial automaton
fstprint --osymbols=words.txt --isymbols=words.txt L.fst
0 1 a a
1 2 b b
1 3 c c
2 4 <eps> <eps>
3 4 <eps> <eps>
4

The language a(b|c)

The language a(b|c)

Running fstclosure produces (a(b|c))*:

========================
Run fstclosure for Kleene Star
fstclosure L.fst Lstar.fst
fstprint --osymbols=words.txt --isymbols=words.txt Lstar.fst
5 0 <eps> <eps>
5
0 1 a a
1 2 b b
1 3 c c
2 4 <eps> <eps>
3 4 <eps> <eps>
4 0 <eps> <eps>
4
========================

(a(b|c))*

The language (a(b|c))*

While running fstclosure --closure_plus produces (a(b|c))+.

========================
Run fstclosure for Kleene plus
fstclosure --closure_plus L.fst Lplus.fst
fstprint --osymbols=words.txt --isymbols=words.txt Lplus.fst
0 1 a a
1 2 b b
1 3 c c
2 4 <eps> <eps>
3 4 <eps> <eps>
4 0 <eps> <eps>
4

(a(b|c))+

The language (a(b|c))+

Resource management recipe

The initial parts of the Kaldi Resource Management recipe have been run on the server.  See /projects/speech/sys/kaldi-trunk/egs/rm/s5.  Makefile has targets for building the demo, and some additional ones for examining things.  For instance this shows the representation of the lexicon as an fst:


make show_fst_lex
/projects/speech/sys/kaldi-trunk/tools/openfst/bin/fstprint --isymbols=data/lang/phones.txt --osymbols=data/lang/words.txt data/lang/L.fst | head -50
...
1    157    ax_B    AROUND
1    160    ax_B    ARRIVAL
1    164    ax_B    ARRIVE
1    167    ax_B    ARRIVED
1    171    ax_B    ARRIVING
1    176    eh_B    ARROW
1    178    ax_B    AS
1    179    ax_B    ASTORIA
1    185    ey_B    ASUW
1    194    ey_B    ASW