We are almost set with initial language data for German, Korean, and Spanish.
- For German, it’s Verbmobil I. The data on Kay include audio, orthographic transcrips, and phonetic transcripts. It seems to be all we need. See
/projects/speech/corpus/VM1
. - For Korean, Bruce has installed
LDC2003S07 Korean Telephone Conversations Complete Set on Kay.
It needs to be verified, but apparently it includes speech, orthographic transcript, and corresponding lexicon.
Correction: the combined corpus is not there yet, but the individual parts are. See these locations on kay (lexicon, speech, transcript).
/projects/ldc/ldc-standard-license/2003/LDC2003L02
/projects/ldc/ldc-standard-license/2003/LDC2003S03
/projects/ldc/ldc-standard-license/2003/LDC2003T08
- For Spanish, we are buying the CALLHOME lexicon, speech, and transcript. Fisher Spanish is already installed on kay. In the egs directory, there is already a script for Fisher Spanish+Callhome. Rather than pretending it’s not done already, let’s focus on building on it to add more data, improve the morphology, and whatever.
Naomi Enzinna, has joined the Spanish group. Prof. Jiwon Yun from Stony Brook will collaborate on Korean. She already has a good initial solution for grapheme to phoneme conversion, which could make it possible to include the Korean Broadcast News/Transcripts, where we need the phonetic lexicon.