This is the first post in the series on automated machine translation. The content is largely taken from class notes from a natural language processing course I took a few years ago. Although it will contain some mathematical equations, it is meant to be understood by a non-technical audience. If there are any errors, please correct me.
Introduction
“Et dixit: Ecce, unus est populus, et unum labium omnibus: cœperuntque hoc facere, nec desistent a cogitationibus suis, donec eas opere compleant. Venite igitur, descendamus, et confundamus ibi linguam eorum, ut non audiat unusquisque vocem proximi sui … Et idcirco vocatum est nomen ejus Babel, quia ibi confusum est labium universæ terræ: et inde dispersit eos Dominus super faciem cunctarum regionum.” - Genesis xi:6-7, 9.
The idea of automated machine translation started with the end of World War II. At that time mathematicians who had cracked the German Enigma code started discussing whether they could use their cryptography techniques to translation. Warren Weaver, a famous mathematician and cryptographer, in his seminal 1949 memorandum proposed the following,
It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the “Chinese code.” If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation?
Although Weaver specifically mentions Chinese as a language to translate, with the advent of the Cold War between the West and the Soviet Union, many saw the need for automated Russian-to- English translations. The fruits of these efforts were seen some five years after Weaver’s memorandum. On January 7, 1954, at its headquarters in New York, IBM held the first public demonstration of a computer developed collaboratively by IBM and Georgetown that successfully translated sixty sentences of Russian texts on various topics to English. Although the computer was relatively simple, having been only programmed with six “grammar” rules and two-hundred fifty words most of which were directed towards organic chemistry terms, these results captivated the public’s attention and kindled other researchers’ interest in the field.
Prominent researchers in the automated machine translation field even announced that the problem would be solved within a decade. The dream of having a computer take speech of one speaker in one language, translate it into another language, and then synthesize a voice speaking the translated text seemed within reach. Such optimism, however, quickly evaporated, as reality set in. In 1966, the Automatic Language Processing Advisory Committee (ALPAC), a committee established by the US government to evaluate research done in the computational linguistics field, published a report on machine translation. In its report, ALPAC expressed skepticism over progress of machine translation. It found that it was not cost-effective, as computers at this time were very large and expensive. ALPAC also questioned its practically, specifically noting that an increasing number of American scientists and engineers, the most likely potential users of such a system, were already gaining fluency of the Russian language and were able to read Soviet scientific journals. The effect of the ALPAC report was to bring about the end of substantial government funding for some twenty years.
Although the ALPAC report certainly curbed enthusiasm in the machine translation field, the field did not completely die and continued on. With the increased availability of computers and growing consumer demand, interest in machine translation picked back up again in the 1980s. Researchers were also encouraged by the successes in the closely related field of speech processing. Today, machine translation services are one of the most frequently used services that some websites provide. Popular machine translation services include Systran (Babelfish) and Google Translate.
There are primarily two approaches in machine translation: (1) rule-based and (2) statistical. These will covered in the next posts.