Author Archives: Mats Rooth

Korean telephone conversation alignment

Korean telephone conversation aligment

This is one of the alignments that the Korean team reported on Wednesday Oct 26. Spot checks suggest the process worked. 대~한민국!

Verbmobil I alignment

Leave a reply

VM1 phone alignment

This is one of the VM1 alignments that the German team reported on Friday, Oct. 21. Weiter so!

Spanish data update

Leave a reply

Corpora for the Spanish Fisher-Callhome recipe are online, for instance /projects/ldc/ldc-standard-license/1996/LDC96T17/callhome_spanish_trans_970711. Paths have been added to the earlier post.

Korean data update

Leave a reply

The LDC Korean data is all on the server, except for the corpus from 1996, with files extracted. Paths have been added to the earlier post, e.g. /projects/ldc/ldc-standard-license/2003/LDC2003S0. At the outset we are concentrating on the data from 2003.

Kaldi-alignments-matlab

3 Replies

https://github.com/MatsRooth/Kaldi-alignments-matlab is a Matlab program that displays Kaldi alignments like the ones we looked at in class yesterday. Install it from GitHub. You also need to install the voicebox toolbox and put it on the Matlab path. Open display_switch.m in the editor, run it from the GUI (with the default argument that is supplied) and accept the suggestion to switch the Matlab directory. Data from the Librispeech corpus are read from sub-directories that are included with the distribution.

The above worked for me on a Mac different from the one where I developed it. Try it, it’s a good way of checking qualitatively how well the alignment is working. We want to include code in the project directories on Kay that generates data files like the ones in the matlab-mat and matlab-dat sub-directories.

LDC licenses

Leave a reply

Please print the LDC standard license from here.

https://confluence.cornell.edu/display/nlpldc/License+Agreement+Drop-Off+for+LDC+%28Linguistic+Data+Consortium%29+Corpora

At the bottom, write your name, netid and signature. Then scan it and load it back on confluence. Finally ask Stacy to add you to the en-cl-ldc group.

Language data roundup

Leave a reply

We are almost set with initial language data for German, Korean, and Spanish.

For German, it’s Verbmobil I. The data on Kay include audio, orthographic transcrips, and phonetic transcripts. It seems to be all we need. See /projects/speech/corpus/VM1.
For Korean, Bruce has installed
LDC2003S07 Korean Telephone Conversations Complete Set on Kay.
It needs to be verified, but apparently it includes speech, orthographic transcript, and corresponding lexicon.
Correction: the combined corpus is not there yet, but the individual parts are. See these locations on kay (lexicon, speech, transcript).
/projects/ldc/ldc-standard-license/2003/LDC2003L02 /projects/ldc/ldc-standard-license/2003/LDC2003S03 /projects/ldc/ldc-standard-license/2003/LDC2003T08
For Spanish, we are buying the CALLHOME lexicon, speech, and transcript. Fisher Spanish is already installed on kay. In the egs directory, there is already a script for Fisher Spanish+Callhome. Rather than pretending it’s not done already, let’s focus on building on it to add more data, improve the morphology, and whatever.

Naomi Enzinna, has joined the Spanish group. Prof. Jiwon Yun from Stony Brook will collaborate on Korean. She already has a good initial solution for grapheme to phoneme conversion, which could make it possible to include the Korean Broadcast News/Transcripts, where we need the phonetic lexicon.

German — Verbmobil I

Leave a reply

We’ve obtained German data from this 1990’s speech translation project, courtesy of the BAS CLARIN repository. Orthographic and phonetic transcripts are included. See
/projects/speech/corpus/VM1.

LDC member years for Cornell

Leave a reply

These are Cornell’s memberships years in LDC. For subscription years, the corpus costs nothing additional, and is most likely already in the lab. For standard years, there may be a cost or not, depending on how many corpora have already been obtained. For other years, we get the data for 1/2 price, which tends to be a lot. So it is good to start with recent years.

Membership Years

    1995 (Not-for-Profit, Standard)
    1998 (Not-for-Profit, Standard)
    1999 (Not-for-Profit, Standard)
    2000 (Not-for-Profit, Standard)
    2002 (Not-for-Profit, Standard)
    2003 (Not-for-Profit, Standard)
    2004 (Not-for-Profit, Standard)
    2005 (Not-for-Profit, Subscription)
    2006 (Not-for-Profit, Standard)
    2007 (Not-for-Profit, Standard)
    2008 (Not-for-Profit, Subscription)
    2009 (Not-for-Profit, Subscription)
    2010 (Not-for-Profit, Subscription)
    2011 (Not-for-Profit, Subscription)
    2012 (Not-for-Profit, Subscription)
    2013 (Not-for-Profit, Subscription)
    2014 (Not-for-Profit, Subscription)
    2015 (Not-for-Profit, Subscription)
    2016 (Not-for-Profit, Subscription)

German data

1 Reply

These are German speech and associated data available from LDC. The speech data for HUB5 German is CALLHOME German Speech.

Finite State Methods

LING 4485/6485 Fall 2016

Author Archives: Mats Rooth

Korean telephone conversation alignment

Verbmobil I alignment

Spanish data update

Korean data update

Kaldi-alignments-matlab

LDC licenses

Language data roundup

German — Verbmobil I

LDC member years for Cornell

German data