Once you have your alignments you might need to retrieve data from them.
Kaldi’s show-alignments generates an alignment file that is “readable for humans”. Here’s how to invoke it:
show-alignments $basePath”/data/lang/phones.txt” final.mdl ark:ali.1 > ali.1.txt
show-alignments $basePath”/data/lang/phones.txt” final.mdl ark:ali.2 > ali.2.txt
show-alignments $basePath”/data/lang/phones.txt” final.mdl ark:ali.3 > ali.3.txt
show-alignments $basePath”/data/lang/phones.txt” final.mdl ark:ali.4 > ali.4.txt
- The file phones.txt is in data/lang/;
- The file final.mdl is in exp/mono_ali/;
- The files ali.1, ali.2, ali.3, ali.4 are in exp/mono_ali/. They have to be unziped (gunzip) before being used.
Before running show-alignments, the alignment files look like this:
f01br16b22k1-s003 3 12 18 17 1826 1825 1825 1825 1828 1830 1829 1829 1829 1829 1256 1258 1257 1257 1260 1259 1259 1259 1259 1259 1259 362 361 361 361 364 363 363 363 366 2750 2749 2749 2749 2749 2749 2749 2749 2749 2752 2751 2751 2754 2753 2753 2753 1928 1927 1927 1930 1929 1929 1932 1931 1931 1931 1931 980 982 984 1826 1825 1825 1825 1828 1827 1827 1827 1830 1829 2678 2677 2677 2677 2680 2679 2682 2681 2486 2485 2485 2485 2485 2485 2485 2488 2487 2487 2490 2489 2489 2489 2504 2503 2503 2503 2503 2506 2508 2738 2737 2737 2737 2737 2740 2739 2739 2739 2742 974 973 973 973 973 976 978 1814 1816 1818 2504 2503 2503 2506 2505 2508 2507 2507 2507 4 14 15 15 15 15 15 15 12 10 10 10 10 10 10 10 10 10 10 10 10 10 10 18 17 17 17 17 17 17 17 17 17
The above is utterance number 3 (s003) for female speaker 1 (f01br16b22k1) in the West Point Brazilian Portuguese LDC Corpus.
After your run show-alignments, you will see:
f01br16b22k1-s003 [ 3 12 18 17 ] [ 1826 1825 1825 1825 1828 1830 1829 1829 1829 1829 ] [ 1256 1258 1257 1257 1260 1259 1259 1259 1259 1259 1259 ] [ 362 361 361 361 364 363 363 363 366 ] [ 2750 2749 2749 2749 2749 2749 2749 2749 2749 2752 2751 2751 2754 2753 2753 2753 ] [ 1928 1927 1927 1930 1929 1929 1932 1931 1931 1931 1931 ] [ 980 982 984 ] [ 1826 1825 1825 1825 1828 1827 1827 1827 1830 1829 ] [ 2678 2677 2677 2677 2680 2679 2682 2681 ] [ 2486 2485 2485 2485 2485 2485 2485 2488 2487 2487 2490 2489 2489 2489 ] [ 2504 2503 2503 2503 2503 2506 2508 ] [ 2738 2737 2737 2737 2737 2740 2739 2739 2739 2742 ] [ 974 973 973 973 973 976 978 ] [ 1814 1816 1818 ] [ 2504 2503 2503 2506 2505 2508 2507 2507 2507 ] [ 4 14 15 15 15 15 15 15 12 10 10 10 10 10 10 10 10 10 10 10 10 10 10 18 17 17 17 17 17 17 17 17 17 ]
f01br16b22k1-s003 SIL m_B ew1_E a_B v_I o1_E eh1_S m_B ujn1_I t_I u_E v_B eh1_I lj_I u_E SIL
The above are the transition IDs (per frame) of each phone, and the phone is the text below – always aligned with the left-most TID – transition IDs are integers that encode the PDFs (probability density function), the phone identity, and information about self-loops or forward transitions.
The text part above corresponds to phones and the position they occupy in a word: _B is the beginning of a word, _E is the end, _I is word-internal phone, and _S is a singleton (word with one sound only).
The file above can be tweaked using your preference of sed awk, or similar, to look like this:
m ew1 a v o1 eh1 m ujn1 t u v eh1 lj u
These are the aligned words of utterance s003 for female speaker 1 above (literal translation: “My grandfather is very old” – don’t blame the speaker, they were reading prompts for the corpus)