LDC member years for Cornell

These are Cornell’s memberships years in LDC. For subscription years, the corpus costs nothing additional, and is most likely already in the lab. For standard years, there may be a cost or not, depending on how many corpora have already been obtained. For other years, we get the data for 1/2 price, which tends to be a lot. So it is good to start with recent years.

Membership Years

    1995 (Not-for-Profit, Standard)
    1998 (Not-for-Profit, Standard)
    1999 (Not-for-Profit, Standard)
    2000 (Not-for-Profit, Standard)
    2002 (Not-for-Profit, Standard)
    2003 (Not-for-Profit, Standard)
    2004 (Not-for-Profit, Standard)
    2005 (Not-for-Profit, Subscription)
    2006 (Not-for-Profit, Standard)
    2007 (Not-for-Profit, Standard)
    2008 (Not-for-Profit, Subscription)
    2009 (Not-for-Profit, Subscription)
    2010 (Not-for-Profit, Subscription)
    2011 (Not-for-Profit, Subscription)
    2012 (Not-for-Profit, Subscription)
    2013 (Not-for-Profit, Subscription)
    2014 (Not-for-Profit, Subscription)
    2015 (Not-for-Profit, Subscription)
    2016 (Not-for-Profit, Subscription)

Korean data

These are Korean speech and associated data available from LDC. There are duplications. I included the finite state morphology and morphologically annotated text.

  1. LDC2006S42 Korean Broadcast News Speech
    /projects/ldc/ldc-standard-license/2006/LDC2006S42
  2. LDC2006T14 Korean Broadcast News Transcripts
    /projects/ldc/ldc-standard-license/2006/LDC2006T14
  3. LDC2006S36 West Point Korean Speech
    /projects/ldc/ldc-standard-license/2006/LDC2006S36
  4. LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
    /projects/ldc/ldc-standard-license/2004/LDC2004L01
  5. LDC2004T03 Morphologically Annotated Korean Text
    /projects/ldc/ldc-standard-license/2004/LDC2004T03
  6. LDC2003S07 Korean Telephone Conversations Complete Set
    /projects/ldc/ldc-standard-license/2003/LDC2003S07
  7. LDC2003L02 Korean Telephone Conversations Lexicon
    /projects/ldc/ldc-standard-license/2003/LDC2003L02
  8. LDC2003S03 Korean Telephone Conversations Speech
    /projects/ldc/ldc-standard-license/2003/LDC2003S03
  9. LDC2003T08 Korean Telephone Conversations Transcripts
    /projects/ldc/ldc-standard-license/2003/LDC2003T08
  10. LDC96S54 CALLFRIEND Korean

Spanish data

These are Spanish speech and associated data available from LDC. There appear to be duplications, we should work back from the later publications.

  1. LDC2014T23 Fisher and CALLHOME Spanish–English Speech Translationnot on server
  2. LDC2010T04 Fisher Spanish – Transcripts
    /projects/ldc/ldc-standard-license/2010/LDC2010T04
  3. LDC2010S01 Fisher Spanish Speech
    /projects/ldc/ldc-standard-license/2010/LDC2010S01
  4. LDC2006S37 West Point Heroico Spanish Speechnot on server
  5. 1997 HUB5 Spanish Transcriptsnot on server
  6. LDC2002S25 1997 HUB5 Spanish Evaluation
  7. LDC2001T61 CALLHOME Spanish Dialogue Act Annotation
  8. LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE)
  9. 1997 Spanish Broadcast News Transcripts (HUB4-NE)
  10. LDC98T29 HUB5 Spanish Telephone Speech Corpus
  11. LDC98T27 HUB5 Spanish Transcripts
  12. LDC96S57 ALLFRIEND Spanish-Caribbean Dialect
  13. LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
  14. LDC96L16 CALLHOME Spanish Lexicon
    /projects/ldc/ldc-standard-license/1996/LDC96L16
  15. LDC96S35 CALLHOME Spanish Speech
    /projects/ldc/ldc-standard-license/1996/LDC96S35
  16. LDC96T17 CALLHOME Spanish Transcripts
    /projects/ldc/ldc-standard-license/1996/LDC96T17
  17. LDC96S57 CALLFRIEND Spanish-Caribbean Dialect
  18. LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
  19. LDC95S28 LATINO-40 Spanish Read News

fsttopsort

fsttopsort topologically sorts its input if acyclic, modifying it. Otherwise, the input is unchanged. When sorted, all transitions are from lower to higher state IDs. (documentation source)

Cyclic Example:

fstprint –isymbols=words.txt –osymbols=words.txt L1.fst

0 1 <eps> <eps>
1 2 a a
1 3 f f
2 0 <eps> <eps>
3 2 d d

cyclicFST

when we apply fsttopsort, we get a warning saying “input FST is cyclic”.

fsttopsort L1.fst L1sorted.fst
WARNING: fsttopsort: Input FST is cyclic
fstprint –isymbols=words.txt –osymbols=words.txt L1sorted.fst

0 1 <eps> <eps>
1 2 a a
1 3 f f
2 0 <eps> <eps>
3 2 d d

cyclicFSTSorted

As we expected, the output looks unchanged.

Acyclic Example:

fstprint –isymbols=words.txt –osymbols=words.txt L2.fst

1
0 2 <eps> <eps>
0 3 a a
2 5 f f
3 4 c c
4 6 b b
4 1 b b
5 3 d d
6 1 a a

acyclicFST

as we run the operation,

fsttopsort L2.fst L2sorted.fst
fstprint –isymbols=words.txt –osymbols=words.txt L2sorted.fst

6
0 1 <eps> <eps>
0 3 a a
1 2 f f
2 3 d d
3 4 c c
4 5 b b
4 6 b b
5 6 a a

acyclicFSTSorted

we can see the state IDs topologically sorted.

(I produced all the images by running ‘fstdraw [fst filename] | dot -Tpng >[png filename]’)

fstequal

The fstequal command exits with a return code indicating whether or not the two compiled fsts passed in are equivalent. The return code will be 0 if they are equal, and nonzero if they are not.

The easiest way to see the return code is to type `echo $?` immediately after executing fstequal. This command shows the return code of the last executed command, and thus will show you a 0 if the two fsts are equal. For example, let’s c
onsider a perfectly straight fst:

jms852@kay:~$ fstprint straight.fst
0 0 0 0
0 1 1 0
1 2 0 0
2 3 0 0
3 3 0 0
3 3 1 0
jms852@kay:~$ fstequal straight.fst straight.fst
jms852@kay:~$ echo $?
0

Then let’s add another fst that forks off instead of continuing straight, and compare that one:

jms852@kay:~$ fstprint fork.fst
0 0 0 0
0 1 1 0
1 2 0 0
2 3 0 0
2 4 1 0
3 Infinity
4 Infinity
jms852@kay:~$ fstequal straight.fst fork.fst
jms852@kay:~$ echo $?
2

Two fsts can be equivalent even if they are not the exact same file, or if their edges are labelled differently. For example, if I create straight2.fst:

jms852@kay:~$ cat straight.txt
0 0 0 0

0 2 1 0

2 1 0 0

1 3 0 0

3 3 0 0

3 3 1 0

You can see that it is still straight, but the nodes have different labels.

However, we see that fstequal still says that they are equal

jms852@kay:~$ fstequal straight.fst straight2.fst
jms852@kay:~$ echo $?
0

fstprune

According to the documentation:

This operation deletes states and arcs in the input FST that do not belong to a successful path whose weight is no more (w.r.t the natural the natural semiring order) than the threshold t ⊗-times the weight of the shortest path in the input FST.

Weights need to be commutative and have the path property. Both destructive and constructive implemenations are available

Example:

 

 

 

unprunedfst

The fst:

0 0 0 0 0.699999988
0 1 0 0 0.299899995
0 2 0 0 9.99999975e-05
1
2 Infinity

After running

fstprune –weight=3 unpruned.fst pruned.fst

generates the new fst

0 0 0 0 0.699999988
0 1 0 0 0.299899995
1

Which has had the state 2 removed, as well as the transition to that state.

The demo can be found at /projects/speech/sys/kaldi-master/egs/rm/s5-avt26/demo/

fstrandgen

Based on the documentation:

This operation randomly generates a set of successful paths in the input FST. The operation relies on an ArcSelector object for randomly selecting an outgoing transition at a given state in the input FST. The default arc selector, UniformArcSelector, randomly selects a transition using the uniform distribution. LogProbArcSelector randomly selects a transition w.r.t. the weights treated as negative log probabilities after normalizing for the total weight leaving the state. In all cases, finality is treated as a transition to a super-final state.

_____________________________________

Example:

fstrandgen G.fst rand1.fst

fstprint --acceptor --isymbols=words.txt rand1.fst

0 1 WILL
1 2 WABASH
2 3 SEVENTEENTH
3 4 OCTOBER
4 5
5

fstdraw --acceptor --isymbols=words.txt rand1.fst | dot -Tx11

rand2.fst

My demo is under /projects/speech/sys/kaldi-master/egs/rm/s5-sb2295/demo/fstrandgen and you can run it with source demo.sh

fstintersect

fstintersect computes the intersection of two FSAs. An intersection is the same as in math: the intersection of A and B is the set of all inputs,outputs of FSA A that also occur in FSA B.

for example,

fstprint a.fst

a

0   1   1   1

1   2   2    2

1   3   3   3

2   4  0   0

3   4   0   0

4

fstprint b.fs

tb

0   1   1   1

1   2   2   2

2

fstintersect a.fst b.fst | fstprint

final

0   1   1   1

1   2    2   2

2  3    0   0

3