This is one of the alignments that the Korean team reported on Wednesday Oct 26. Spot checks suggest the process worked. 대~한민국!
Author Archives: Mats Rooth
Verbmobil I alignment
This is one of the VM1 alignments that the German team reported on Friday, Oct. 21. Weiter so!
Spanish data update
Corpora for the Spanish Fisher-Callhome recipe are online, for instance /projects/ldc/ldc-standard-license/1996/LDC96T17/callhome_spanish_trans_970711
. Paths have been added to the earlier post.
Korean data update
The LDC Korean data is all on the server, except for the corpus from 1996, with files extracted. Paths have been added to the earlier post, e.g. /projects/ldc/ldc-standard-license/2003/LDC2003S0
. At the outset we are concentrating on the data from 2003.
Kaldi-alignments-matlab
https://github.com/MatsRooth/Kaldi-alignments-matlab is a Matlab program that displays Kaldi alignments like the ones we looked at in class yesterday. Install it from GitHub. You also need to install the voicebox toolbox and put it on the Matlab path. Open display_switch.m in the editor, run it from the GUI (with the default argument that is supplied) and accept the suggestion to switch the Matlab directory. Data from the Librispeech corpus are read from sub-directories that are included with the distribution.
The above worked for me on a Mac different from the one where I developed it. Try it, it’s a good way of checking qualitatively how well the alignment is working. We want to include code in the project directories on Kay that generates data files like the ones in the matlab-mat and matlab-dat sub-directories.
LDC licenses
Please print the LDC standard license from here.
At the bottom, write your name, netid and signature. Then scan it and load it back on confluence. Finally ask Stacy to add you to the en-cl-ldc group.
Language data roundup
We are almost set with initial language data for German, Korean, and Spanish.
- For German, it’s Verbmobil I. The data on Kay include audio, orthographic transcrips, and phonetic transcripts. It seems to be all we need. See
/projects/speech/corpus/VM1
. - For Korean, Bruce has installed
LDC2003S07 Korean Telephone Conversations Complete Set on Kay.
It needs to be verified, but apparently it includes speech, orthographic transcript, and corresponding lexicon.
Correction: the combined corpus is not there yet, but the individual parts are. See these locations on kay (lexicon, speech, transcript).
/projects/ldc/ldc-standard-license/2003/LDC2003L02
/projects/ldc/ldc-standard-license/2003/LDC2003S03
/projects/ldc/ldc-standard-license/2003/LDC2003T08
- For Spanish, we are buying the CALLHOME lexicon, speech, and transcript. Fisher Spanish is already installed on kay. In the egs directory, there is already a script for Fisher Spanish+Callhome. Rather than pretending it’s not done already, let’s focus on building on it to add more data, improve the morphology, and whatever.
Naomi Enzinna, has joined the Spanish group. Prof. Jiwon Yun from Stony Brook will collaborate on Korean. She already has a good initial solution for grapheme to phoneme conversion, which could make it possible to include the Korean Broadcast News/Transcripts, where we need the phonetic lexicon.
German — Verbmobil I
We’ve obtained German data from this 1990’s speech translation project, courtesy of the BAS CLARIN repository. Orthographic and phonetic transcripts are included. See
/projects/speech/corpus/VM1
.
LDC member years for Cornell
These are Cornell’s memberships years in LDC. For subscription years, the corpus costs nothing additional, and is most likely already in the lab. For standard years, there may be a cost or not, depending on how many corpora have already been obtained. For other years, we get the data for 1/2 price, which tends to be a lot. So it is good to start with recent years.
Membership Years 1995 (Not-for-Profit, Standard) 1998 (Not-for-Profit, Standard) 1999 (Not-for-Profit, Standard) 2000 (Not-for-Profit, Standard) 2002 (Not-for-Profit, Standard) 2003 (Not-for-Profit, Standard) 2004 (Not-for-Profit, Standard) 2005 (Not-for-Profit, Subscription) 2006 (Not-for-Profit, Standard) 2007 (Not-for-Profit, Standard) 2008 (Not-for-Profit, Subscription) 2009 (Not-for-Profit, Subscription) 2010 (Not-for-Profit, Subscription) 2011 (Not-for-Profit, Subscription) 2012 (Not-for-Profit, Subscription) 2013 (Not-for-Profit, Subscription) 2014 (Not-for-Profit, Subscription) 2015 (Not-for-Profit, Subscription) 2016 (Not-for-Profit, Subscription)
German data
These are German speech and associated data available from LDC. The speech data for HUB5 German is CALLHOME German Speech.