Category Archives: language data

Language data roundup

We are almost set with initial language data for German, Korean, and Spanish.

  1. For German, it’s Verbmobil I. The data on Kay include audio, orthographic transcrips, and phonetic transcripts. It seems to be all we need. See /projects/speech/corpus/VM1.
  2. For Korean, Bruce has installed
    LDC2003S07 Korean Telephone Conversations Complete Set on Kay.
    It needs to be verified, but apparently it includes speech, orthographic transcript, and corresponding lexicon.
    Correction: the combined corpus is not there yet, but the individual parts are. See these locations on kay (lexicon, speech, transcript).

    /projects/ldc/ldc-standard-license/2003/LDC2003L02
    /projects/ldc/ldc-standard-license/2003/LDC2003S03
    /projects/ldc/ldc-standard-license/2003/LDC2003T08

  3. For Spanish, we are buying the CALLHOME lexicon, speech, and transcript. Fisher Spanish is already installed on kay. In the egs directory, there is already a script for Fisher Spanish+Callhome. Rather than pretending it’s not done already, let’s focus on building on it to add more data, improve the morphology, and whatever.

Naomi Enzinna, has joined the Spanish group. Prof. Jiwon Yun from Stony Brook will collaborate on Korean. She already has a good initial solution for grapheme to phoneme conversion, which could make it possible to include the Korean Broadcast News/Transcripts, where we need the phonetic lexicon.

LDC member years for Cornell

These are Cornell’s memberships years in LDC. For subscription years, the corpus costs nothing additional, and is most likely already in the lab. For standard years, there may be a cost or not, depending on how many corpora have already been obtained. For other years, we get the data for 1/2 price, which tends to be a lot. So it is good to start with recent years.

Membership Years

    1995 (Not-for-Profit, Standard)
    1998 (Not-for-Profit, Standard)
    1999 (Not-for-Profit, Standard)
    2000 (Not-for-Profit, Standard)
    2002 (Not-for-Profit, Standard)
    2003 (Not-for-Profit, Standard)
    2004 (Not-for-Profit, Standard)
    2005 (Not-for-Profit, Subscription)
    2006 (Not-for-Profit, Standard)
    2007 (Not-for-Profit, Standard)
    2008 (Not-for-Profit, Subscription)
    2009 (Not-for-Profit, Subscription)
    2010 (Not-for-Profit, Subscription)
    2011 (Not-for-Profit, Subscription)
    2012 (Not-for-Profit, Subscription)
    2013 (Not-for-Profit, Subscription)
    2014 (Not-for-Profit, Subscription)
    2015 (Not-for-Profit, Subscription)
    2016 (Not-for-Profit, Subscription)

Korean data

These are Korean speech and associated data available from LDC. There are duplications. I included the finite state morphology and morphologically annotated text.

  1. LDC2006S42 Korean Broadcast News Speech
    /projects/ldc/ldc-standard-license/2006/LDC2006S42
  2. LDC2006T14 Korean Broadcast News Transcripts
    /projects/ldc/ldc-standard-license/2006/LDC2006T14
  3. LDC2006S36 West Point Korean Speech
    /projects/ldc/ldc-standard-license/2006/LDC2006S36
  4. LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
    /projects/ldc/ldc-standard-license/2004/LDC2004L01
  5. LDC2004T03 Morphologically Annotated Korean Text
    /projects/ldc/ldc-standard-license/2004/LDC2004T03
  6. LDC2003S07 Korean Telephone Conversations Complete Set
    /projects/ldc/ldc-standard-license/2003/LDC2003S07
  7. LDC2003L02 Korean Telephone Conversations Lexicon
    /projects/ldc/ldc-standard-license/2003/LDC2003L02
  8. LDC2003S03 Korean Telephone Conversations Speech
    /projects/ldc/ldc-standard-license/2003/LDC2003S03
  9. LDC2003T08 Korean Telephone Conversations Transcripts
    /projects/ldc/ldc-standard-license/2003/LDC2003T08
  10. LDC96S54 CALLFRIEND Korean

Spanish data

These are Spanish speech and associated data available from LDC. There appear to be duplications, we should work back from the later publications.

  1. LDC2014T23 Fisher and CALLHOME Spanish–English Speech Translationnot on server
  2. LDC2010T04 Fisher Spanish – Transcripts
    /projects/ldc/ldc-standard-license/2010/LDC2010T04
  3. LDC2010S01 Fisher Spanish Speech
    /projects/ldc/ldc-standard-license/2010/LDC2010S01
  4. LDC2006S37 West Point Heroico Spanish Speechnot on server
  5. 1997 HUB5 Spanish Transcriptsnot on server
  6. LDC2002S25 1997 HUB5 Spanish Evaluation
  7. LDC2001T61 CALLHOME Spanish Dialogue Act Annotation
  8. LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE)
  9. 1997 Spanish Broadcast News Transcripts (HUB4-NE)
  10. LDC98T29 HUB5 Spanish Telephone Speech Corpus
  11. LDC98T27 HUB5 Spanish Transcripts
  12. LDC96S57 ALLFRIEND Spanish-Caribbean Dialect
  13. LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
  14. LDC96L16 CALLHOME Spanish Lexicon
    /projects/ldc/ldc-standard-license/1996/LDC96L16
  15. LDC96S35 CALLHOME Spanish Speech
    /projects/ldc/ldc-standard-license/1996/LDC96S35
  16. LDC96T17 CALLHOME Spanish Transcripts
    /projects/ldc/ldc-standard-license/1996/LDC96T17
  17. LDC96S57 CALLFRIEND Spanish-Caribbean Dialect
  18. LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
  19. LDC95S28 LATINO-40 Spanish Read News

Portuguese Corpora

We have 4 running-speech Portuguese Corpora available:

  • The West Point Brazilian Portuguese Speech corpus LDC2008S04 (Morgan, Ackerlind & Packer, 2008);
  • The CSLU Spoltech Brazilian Portuguese 1.0 LDC2006S16 (Schramm et al., 2006);
  • C-ORAL Brasil;
  • CLUL Spoken Portuguese Corpus, Geographical and Social Varieties (Casteleiro et al., n.d.);

Specific data about each to follow shortly…