Author Archives: Mats Rooth

Korean data update

The LDC Korean data is all on the server, except for the corpus from 1996, with files extracted. Paths have been added to the earlier post, e.g. /projects/ldc/ldc-standard-license/2003/LDC2003S0. At the outset we are concentrating on the data from 2003.

Kaldi-alignments-matlab

https://github.com/MatsRooth/Kaldi-alignments-matlab is a Matlab program that displays Kaldi alignments like the ones we looked at in class yesterday. Install it from GitHub. You also need to install the voicebox toolbox and put it on the Matlab path. Open display_switch.m in the editor, run it from the GUI (with the default argument that is supplied) and accept the suggestion to switch the Matlab directory. Data from the Librispeech corpus are read from sub-directories that are included with the distribution.

The above worked for me on a Mac different from the one where I developed it. Try it, it’s a good way of checking qualitatively how well the alignment is working. We want to include code in the project directories on Kay that generates data files like the ones in the matlab-mat and matlab-dat sub-directories.

Language data roundup

We are almost set with initial language data for German, Korean, and Spanish.

  1. For German, it’s Verbmobil I. The data on Kay include audio, orthographic transcrips, and phonetic transcripts. It seems to be all we need. See /projects/speech/corpus/VM1.
  2. For Korean, Bruce has installed
    LDC2003S07 Korean Telephone Conversations Complete Set on Kay.
    It needs to be verified, but apparently it includes speech, orthographic transcript, and corresponding lexicon.
    Correction: the combined corpus is not there yet, but the individual parts are. See these locations on kay (lexicon, speech, transcript).

    /projects/ldc/ldc-standard-license/2003/LDC2003L02
    /projects/ldc/ldc-standard-license/2003/LDC2003S03
    /projects/ldc/ldc-standard-license/2003/LDC2003T08

  3. For Spanish, we are buying the CALLHOME lexicon, speech, and transcript. Fisher Spanish is already installed on kay. In the egs directory, there is already a script for Fisher Spanish+Callhome. Rather than pretending it’s not done already, let’s focus on building on it to add more data, improve the morphology, and whatever.

Naomi Enzinna, has joined the Spanish group. Prof. Jiwon Yun from Stony Brook will collaborate on Korean. She already has a good initial solution for grapheme to phoneme conversion, which could make it possible to include the Korean Broadcast News/Transcripts, where we need the phonetic lexicon.

LDC member years for Cornell

These are Cornell’s memberships years in LDC. For subscription years, the corpus costs nothing additional, and is most likely already in the lab. For standard years, there may be a cost or not, depending on how many corpora have already been obtained. For other years, we get the data for 1/2 price, which tends to be a lot. So it is good to start with recent years.

Membership Years

    1995 (Not-for-Profit, Standard)
    1998 (Not-for-Profit, Standard)
    1999 (Not-for-Profit, Standard)
    2000 (Not-for-Profit, Standard)
    2002 (Not-for-Profit, Standard)
    2003 (Not-for-Profit, Standard)
    2004 (Not-for-Profit, Standard)
    2005 (Not-for-Profit, Subscription)
    2006 (Not-for-Profit, Standard)
    2007 (Not-for-Profit, Standard)
    2008 (Not-for-Profit, Subscription)
    2009 (Not-for-Profit, Subscription)
    2010 (Not-for-Profit, Subscription)
    2011 (Not-for-Profit, Subscription)
    2012 (Not-for-Profit, Subscription)
    2013 (Not-for-Profit, Subscription)
    2014 (Not-for-Profit, Subscription)
    2015 (Not-for-Profit, Subscription)
    2016 (Not-for-Profit, Subscription)