Skip to main content



Hidden Markov Models and Text Translation

Over the last 5 years, there has been substantial improvement in Google’s translate mechanism. Today, Google Translate supports over 64 languages and the quality of their translations seem to improve every year. Google translate can very accurately translate large bodies of text, and is very useful for people traveling across the world. Obviously, there isn’t a person at Google who looks through all of the possible combinations of words across various languages by hand and determines what word translates to what. Instead, Google does what it does best; it uses thousands of machines to determine what translations are best suited from one language to another. How is Google able to automate the process and still provide the translation from one language to another that people are looking for?

As one would expect, the engineers at Google use a lot of sophisticated Machine Learning algorithms and cutting edge research to solve the text translation problem. However, at the heart of a lot of these algorithms are concepts from basic probability theory. A Machine Learning algorithm at its core is very simple. It is given a large dataset of training data, which is correctly labeled, in this case, correctly translated from one language to another. Using this data, and a set of features which pertain to the data set, when asked to classify a new example, it simply asks the following question: “What is the best classification for this new example given what I know about the problem from the training examples?”

One way to do this is by using Hidden Markov Models (HMM). Given some sentence $x=(x_1,x_2,..,x_n)$ in the language you want to translate from, you want to predict what the most likely sentence $y=(y_1,…,y_m)$ is going to be in the language you want to translate to. A simple first-order HMM will do this by using two types of conditional probabilities, $\Pr[x_i|y_i]$, which are known as “emission probabilities”, and $\Pr[y_i|y_{i-1}]$ which are the “transition probabilities”. The objective of a first-order HMM is to find a translation from vector $x$ to vector $y$ such that it does the following: $\arg \max_{y \in Y} \Pr[y_1]*\Pr[x_1|y_1]*\prod_{i=2}^{l} \Pr[x_i|y_i]*\Pr[y_i|y_{i-1}$. An interpretation of that formula means, we are trying to translate a sequence of l words from language 1 to language 2. The way HMM does this is by looking at all possible combinations of a sequence of l words in language 2. For each sequence it computes the probability of that sequence being the correct translation given that the ith word from the 1st language is being translated into the ith word in the sequence and given that word i-1 came before it. Then it takes the sequence of words that maximize this probability and outputs that as the translation. Since we don’t know what these conditional probabilities are, they are simply estimated by looking at the training data we are provided and doing counting.

There are two very big problems with the way first-order Hidden Markov Models work. The first problem is that the number of possible sequences of length l in a language is exponential in the length of the sequence. This poses a computational challenge because doing a brute force approach would make it nearly impossible to translate a sentence of length 50 or 100. However, we can find a simple work around for this by using dynamic programming techniques. The other problem however is that a first-order HMM simply looks at the preceding word in the sentence when making a translation. This means that it doesn’t take into consideration sentence structures and the fact that grammatical rules differ widely from one language to another. Therefore, a sequence of words that make sense in one language if directly translated to another would not make sense.

We can try to counteract the second problem by using higher order Hidden Markov Model which look at a longer sequence of decisions in the past when making a choice for the current word but this makes predicting the conditional probabilities a lot harder because we have a lot more of them to predict and there might not be enough data available to accurately predict these probabilities. Instead, what Google does is a bit more complex. They look at the underlying structure of the sentences and try to determine how “lossy” a translation is. Or how much of the original meaning of the sentence is lost when translating it one way versus another. Then, they apply a cost function which penalizes the algorithm for spewing out more lossy translations. Thereby, Google can produce a translation which preserves the meaning of the sentence even if each word isn’t translated as accurately as possible.

Furthermore, they experiment with various parameters to figure out which features of text are the most vital, and use part of speech tagging to distinguish between nouns, verbs and other grammatical constructs. Part of speech tagging can also be done using Hidden Markov Model. Since there is an abundance of data that Google has in English, often while doing language translations, their algorithms will translate the text first to English, and then from English to the language of interest because this improves the probability of a correct translation rather than doing a direct translation from one language to another.

Since Google uses statistical approaches in its language translation, unlike most competitors which try to learn grammatical rules and use them to translate text, Google Translate has the advantage that it can crunch through large sets of existing data and compute probabilities efficiently to quickly provide the most accurate translation because statistical approaches tend to be computationally easier in language translation then trying to parse and evaluate grammatical rules. Furthermore, since Google gets data from books, organizations like the UN, and websites all around the world, they have access to a lot more data than any other competitor; therefore they can use statistical learning techniques for text translation. Google’s reasoning behind using statistical learning instead of rule based learning is that almost all grammatical rules have long lists of exceptions which become difficult to encode and evaluate efficiently.

Today, Google translate has also incorporated a “listen” feature for its smartphone apps. The “listen” feature allows users to speak in one language, and it will translate the text, and actually speak the translation in the language of your choice. Once again, this opens up a large area of research where Google spends a tons of its resources; speech recognition. Here also, Google uses a large amount of machine learning techniques and statistical techniques which allows their products to correctly interpret the spoken words and then translate them in the desired language. In order to help them with collecting the initial training data, Google has hires a lot of linguists who work on identifying correct translations and help improve the automated process.

http://www.mt-archive.info/MTS-2005-Och.pdf

http://www.forbes.com/forbes/2010/0809/technology-computer-learning-google-translate.html

http://www.geekosystem.com/how-does-google-translate-work/

Sorry for the awkard math notation, I wasn’t sure how I could embed LaTeX into my blogpost

Comments

Leave a Reply

Blogging Calendar

November 2012
M T W T F S S
 1234
567891011
12131415161718
19202122232425
2627282930  

Archives