Quantcast
Channel: Koreabridge MegaBlog Feed
Viewing all articles
Browse latest Browse all 7726

An Overview of Automated Machine Translation – Statistical Machine Translation

$
0
0
Captain Kirk from the original Star Trek holding a

Captain Kirk from the original Star Trek holding a “universal translator” device. In this episode, it is explained that this device translates based on how often (the frequency) signals occur in the alien language.

This is the third and final post in the series on automated machine translation. The content is largely taken from class notes from a natural language processing course I took a few years ago. Although it will contain some mathematical equations, it is meant to be understood by a non-technical audience. If there are any errors, please correct me.

Statistical Machine Translation

Background 

During World War II, mathematicians used the frequencies of letters in the German language to crack the Enigma code. For instance, if the letter “z” appeared 1.13% of the time in a German text, then assuming that encryption was letter substitution whatever letter it was encrypted to should also appear 1.13% of the time in the encrypted text. After the end of the War, these same mathematicians pondered whether they could use these same techniques to solve translation, as proposed in Weaver’s classic 1949 memorandum. These mathematicians viewed language akin to an encryption, changing the message from one understandable representation (e.g., English) to another encoded representation (e.g., Russian). Noam Chomsky, a noted and influential linguist, however, dismissed these ideas, arguing that statistical approaches to linguistics were inappropriate. Although his argument seemingly exhibited a lack of understanding of probability, Chomsky’s writings had a large impact on the field and researchers did not delve into statistical machine translation for more than four decades. That was the case until the 1990s, when IBM started looking at this problem using this approach and published the seminal paper, “The Mathematics of Statistical Machine Translation: Parameter Translation” in 1993.

What is Statistical Machine Translation?

What statistical machine translation algorithms ask is given a text in one language, say French, what is the most probable text in another language, e.g., English? Notice what is not being asked. In  contrast to the rule-based approach examined in the previous post, statistical machine translation is “ignorant” about the grammar rules of either the source or target language. Rather, it looks to the probability and answers the question with three measures of probability: (a) translation model; (b) language model; and (c) the decoder algorithm.

High-level overview of statistical machine translation

High-level overview of statistical machine translation

Translation Model

The translation model attempts to match the strings (i.e., words or phrases) of the source language to strings of the target language. Toward that end, the model looks at each pair of strings and assigns a probability value to the pair, p(f|e). This value is a conditional probability, and in this case is the probability of one string in the source language (labeled “f“) given the occurrence of another string in the target language (labeled “e“).* The IBM paper gave the variable names “f” and “ebecause they were trying to develop a machine translation system that would translate French texts (“ffor French) into English (“e” for English).

The values of p(f|e) are determined based on preexisting human translations of one language to the other, called “parallel corpus.” The 1993 IBM paper used Canadian parliament proceedings, which are in both French and English. The higher the value of p(f|e), the more likely that the machine translation will actually look like a translation; and the lower the value of p(f|e), the less likely that the machine translation will look like a translation, but instead garbled to a human reader.

Note that one word in one language could correspond to multiple words in another, or none at all. Further note that the word order in one language could be different from the word order from another. For instance, for the sentence “I rode a car,” the Korean translation is “나는 차를 탔다” (naneun chareul tatda). Note that “a” in the English sentence does not have any corresponding word in Korean translation. Furthermore,  ”car” corresponds not only “차” (cha) but also the grammatical particle “를” (reul).

Alignment Formula

Alignment

p(f|e) captures not only the fact that one word in the source language could correspond to multiple words in another language or none but also the difference in word order between the two languages.  This is called “alignment,” labeled “a” in the equation above, and keeps track of which string in the source language each string in the target language originated from. If there are M number of strings in the source language and L number of strings in the corresponding target language, then the number of possible alignments, labeled “A”, is (L+1)^M. There are a variety of methods to calculate the alignment.

* This seems the complete opposite to the question asked earlier, i.e., what is the probability of text in the target language given the source. Note that, however, the translation model is just one part of the entire algorithm and it is easier to calculate p(f|e) than it is to calculate p(e|f) directly.

Language Model

The language model determines the probability of the string of the target language actually occurring in that language, p(e). Unlike the translation model, parallel corpus is not needed and text in only one language is required. There are a number of ways to determining this value. One example is the trigram model, perhaps the simplest model. In this model, the probability that a sentence of length will occur in the language is the product of probability of each k-th word given the occurrence of the prior two words, k-1 and k-2.

For an example, say there are two sentences in Korean, “차가 탔다 나를” (chaga tatda nareul, car-rode-me) and “나는 차를 탔다” (naneun chareul tatda, I-rode-car). A properly trained language model would assign property values based on whether such a sentence would occur in Korean. In this case, the second sentence would be assigned a higher probability value, because it is more likely to appear in Korean texts since it follows Korean grammar rules. In contrast, the first sentence would be assigned the lower probability values, because it is not likely to appear in Korean texts, as a normal Korean author would not write such a sentence.

Decoder Algorithm

The decoder algorithm gets back to the fundamental question of statistical machine translation: given a sentence in one language, what is the most probable sentence in the other? After having calculated the product of the translation model and the language model, the decoder algorithm selects the string of the target language with the highest probability.

Limitations of Statistical Machine Translation

Statistical machine translation did meet a lot of success. Google Translate is a highly received machine translation system that uses statistical methods. Furthermore, Systran, although historically a rule-based machine translation system, has now incorporated statistical machine translation. There are, however, a number off limitations in statistical machine translation.

Firstly, statistical machine translations have difficulty translating properly long and complicated sentences. Note that in certain languages, such as Korean and German, it is not unheard of and may be even common in certain areas of writing to see sentences that run more than a page long. Alignment may suffer poor performance for such long and complex sentences.

Secondly, the corpus content and size. Statistical machine translations often use government texts (e.g., treaties, laws, annals of parliament, etc.) with parallel translation as training data, because those are the most readily available. It is thus no surprise that statistical machine translation often give poor results with texts of certain fields. Increasing the size of the corpus, however, may not solve this problem, as a problem known as overfitting, also known as overtraining, then becomes an issue. Overfitting will result in the algorithm not being able to generalize to inputs not presented during its training.

Thirdly, sheer computational complexity. Some of the underlying mathematics of the translation and language models have been omitted from this blog post, but these models require substantial amount of computation. Some of the best systems take over an hour to translate one sentence.



kuiwon.wordpress.com

 

Copyright Notice

 


Viewing all articles
Browse latest Browse all 7726

Trending Articles