Author Archives: Mats Rooth

Korean data

These are Korean speech and associated data available from LDC. There are duplications. I included the finite state morphology and morphologically annotated text.

  1. LDC2006S42 Korean Broadcast News Speech
    /projects/ldc/ldc-standard-license/2006/LDC2006S42
  2. LDC2006T14 Korean Broadcast News Transcripts
    /projects/ldc/ldc-standard-license/2006/LDC2006T14
  3. LDC2006S36 West Point Korean Speech
    /projects/ldc/ldc-standard-license/2006/LDC2006S36
  4. LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
    /projects/ldc/ldc-standard-license/2004/LDC2004L01
  5. LDC2004T03 Morphologically Annotated Korean Text
    /projects/ldc/ldc-standard-license/2004/LDC2004T03
  6. LDC2003S07 Korean Telephone Conversations Complete Set
    /projects/ldc/ldc-standard-license/2003/LDC2003S07
  7. LDC2003L02 Korean Telephone Conversations Lexicon
    /projects/ldc/ldc-standard-license/2003/LDC2003L02
  8. LDC2003S03 Korean Telephone Conversations Speech
    /projects/ldc/ldc-standard-license/2003/LDC2003S03
  9. LDC2003T08 Korean Telephone Conversations Transcripts
    /projects/ldc/ldc-standard-license/2003/LDC2003T08
  10. LDC96S54 CALLFRIEND Korean

Spanish data

These are Spanish speech and associated data available from LDC. There appear to be duplications, we should work back from the later publications.

  1. LDC2014T23 Fisher and CALLHOME Spanish–English Speech Translationnot on server
  2. LDC2010T04 Fisher Spanish – Transcripts
    /projects/ldc/ldc-standard-license/2010/LDC2010T04
  3. LDC2010S01 Fisher Spanish Speech
    /projects/ldc/ldc-standard-license/2010/LDC2010S01
  4. LDC2006S37 West Point Heroico Spanish Speechnot on server
  5. 1997 HUB5 Spanish Transcriptsnot on server
  6. LDC2002S25 1997 HUB5 Spanish Evaluation
  7. LDC2001T61 CALLHOME Spanish Dialogue Act Annotation
  8. LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE)
  9. 1997 Spanish Broadcast News Transcripts (HUB4-NE)
  10. LDC98T29 HUB5 Spanish Telephone Speech Corpus
  11. LDC98T27 HUB5 Spanish Transcripts
  12. LDC96S57 ALLFRIEND Spanish-Caribbean Dialect
  13. LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
  14. LDC96L16 CALLHOME Spanish Lexicon
    /projects/ldc/ldc-standard-license/1996/LDC96L16
  15. LDC96S35 CALLHOME Spanish Speech
    /projects/ldc/ldc-standard-license/1996/LDC96S35
  16. LDC96T17 CALLHOME Spanish Transcripts
    /projects/ldc/ldc-standard-license/1996/LDC96T17
  17. LDC96S57 CALLFRIEND Spanish-Caribbean Dialect
  18. LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
  19. LDC95S28 LATINO-40 Spanish Read News

Fstprint and fstcompile

The command line program fstprint prints a machine in text format. If the arguments --isymbols or --osymbols are not included, labels are printed as numbers. With the symbol arguments, labels are printed in string form. Conversely fstcompile compiles a file in text format into binary. If the source file uses labels in text form, the symbol arguments should be supplied.

The demo uses the lexicon L.fst from resource management. Here is the
result of running /home/mr249/s5-mr249/demo/fstprint_fstcompile/demo.sh.

=========================
LANG=../../data/lang
echo $LANG
../../data/lang
=========================
Run fstprint without symbols.
fstprint $LANG/L.fst | head
0	1	0	0	0.693147182
0	1	1	0	0.693147182
1	2	5	1	0.693147182
1	1	5	1	0.693147182
1	2	29	2	0.693147182
1	1	29	2	0.693147182
1	3	74	3
1	15	74	4
1	21	10	5
1	26	26	6
=========================
Run fstprint with symbols.
fstprint --osymbols=$LANG/words.txt --isymbols=$LANG/phones.txt $LANG/L.fst | head
0	1			0.693147182
0	1	sil		0.693147182
1	2	sil_S	!SIL	0.693147182
1	1	sil_S	!SIL	0.693147182
1	2	ax_S	A	0.693147182
1	1	ax_S	A	0.693147182
1	3	ey_B	A42128
1	15	ey_B	AAW
1	21	ae_B	ABERDEEN
1	26	ax_B	ABOARD
=========================
The same, putting the result in L.fstx
fstprint --osymbols=$LANG/words.txt --isymbols=$LANG/phones.txt $LANG/L.fst > L.fstx
head L.fstx
0	1			0.693147182
0	1	sil		0.693147182
1	2	sil_S	!SIL	0.693147182
1	1	sil_S	!SIL	0.693147182
1	2	ax_S	A	0.693147182
1	1	ax_S	A	0.693147182
1	3	ey_B	A42128
1	15	ey_B	AAW
1	21	ae_B	ABERDEEN
1	26	ax_B	ABOARD
=========================
Compile L.fstx as L2.fst
fstcompile --osymbols=$LANG/words.txt --isymbols=$LANG/phones.txt $LANG/L.fstx L2.fst
Print L2.fst with symbols.
fstprint --osymbols=$LANG/words.txt --isymbols=$LANG/phones.txt L2.fst | head
0	1			0.693147182
0	1	sil		0.693147182
1	2	sil_S	!SIL	0.693147182
1	1	sil_S	!SIL	0.693147182
1	2	ax_S	A	0.693147182
1	1	ax_S	A	0.693147182
1	3	ey_B	A42128
1	4	ey_B	AAW
1	5	ae_B	ABERDEEN
1	6	ax_B	ABOARD

Resource management recipe

The initial parts of the Kaldi Resource Management recipe have been run on the server.  See /projects/speech/sys/kaldi-trunk/egs/rm/s5.  Makefile has targets for building the demo, and some additional ones for examining things.  For instance this shows the representation of the lexicon as an fst:


make show_fst_lex
/projects/speech/sys/kaldi-trunk/tools/openfst/bin/fstprint --isymbols=data/lang/phones.txt --osymbols=data/lang/words.txt data/lang/L.fst | head -50
...
1    157    ax_B    AROUND
1    160    ax_B    ARRIVAL
1    164    ax_B    ARRIVE
1    167    ax_B    ARRIVED
1    171    ax_B    ARRIVING
1    176    eh_B    ARROW
1    178    ax_B    AS
1    179    ax_B    ASTORIA
1    185    ey_B    ASUW
1    194    ey_B    ASW