Quantcast
Channel: Koreabridge MegaBlog Feed
Viewing all articles
Browse latest Browse all 7726

An Overview of Automated Machine Translation – The Ruled-Based Approach

$
0
0
Vauquois Triangle

Vauquois Triangle

This is the second post in the series on automated machine translation. The content is largely taken from class notes from a natural language processing course I took a few years ago. Although it will contain some mathematical equations, it is meant to be understood by a non-technical audience. If there are any errors, please correct me.

The Rule-Based Approach

Of the two approaches, the rule-based approach has a longer and separate history. The first demonstration of an automated translation computer developed by IBM and Georgetown was based on this approach. Historically, its theoretical basis is from the so-called “Age of Enlightenment.” Philosophers of this time period theorized that there was a “universal language” (“interlingua”) that underlie all the languages of the world. According to this theory, when a human being translates a text or speech from one language to another language, he first analyzes the source and converts it in his mind into its syntactic (the order of words) and semantic (what the word represents) representations. Then, the translator uses this “interlingua” to transfer these representations to the syntactic and semantic representations of the other languages. Finally, the translator transcribes these representations into the text or speech of the other language. In the Vauquois Triangle in the figure above, this translation approach is along the left and right edges of the triangle.

What rule-based machine translation algorithms do is attempt to mimic this approach. Generally, these algorithms first analyze words of the source text and break it down into its constituent part of speech, e.g., nouns, noun phrases, verbs, adjectives, adverbs, propositions, and etc.  They then parse these words according to the grammar of the source language into a “parse tree” (this topic deserves its own post). An example of a parse tree is pictured below. Next, these algorithms map these words to words or phrases of the target language using a dictionary. Finally, the algorithm rearranges these words and phrases of the target language according to the grammar of the target language, and produces the output translation.

“I rode down the street in a car” parsed into a parsing tree (Madsen 2009)

For instance, for the sentence “I rode a car,” shortened for sake of simplicity, first the algorithm would label “I” as a noun, “rode” as a verb, “a” as an article, and “car” as a noun. It then would parse these words, such that “I” is the subject, “rode” as a verb, “a” as an article modifying the word “car,” and “car” as the direct object. These last three words, would come under the “verb phrase” umbrella, similar to the figure above. Next, if the target language were Korean, “I” would be mapped to “나” (na), “rode” would be mapped to “탔다” (tatda), “a” would be mapped to null because there are no articles in Korean, and “car” would be mapped to “차” (cha”). The algorithm would finally rearrange this sentence according to Korean and in that process add the necessary grammatical particles. Thus, after having rearranged the sentence to Korean word order, “나 차 탔다” (na cha tatda), the algorithm would have to add “는” (neun) after “나” to indicate that it is the subject and “를” (reul) to indicate that it is the direct object. Finally, the output would be the “나는 차를 탔다” (naneun chareul tatda).

Notice that what is missing in this algorithm is the “interlingua” step depicted in the Vauquois Triangle. This should not come as a surprise, as it is difficult to imagine what this interlingua actually is. Some have proposed that one actual human language (e.g., Chinese or French) could be used as the “interlingua” in an algorithm.

Limitations of the Ruled-Based Approach

The Inherent Ambiguity of Natural Language

Although the rule-based approach sounds intuitive and is supposedly how many humans translate from one language to another, it is difficult for a computer. This is because natural language is inherently ambiguous. Humans, through our daily interactions and life-long experiences, are able to discern among these ambiguities and rule out ridiculous interpretations. Computers, on the other hand, do not have such experiences and are unable to eliminate such interpretations that are otherwise grammatically valid. For instance, with the sentence “I rode down the street in a car” can be parsed such that the “car” literally is in the “street,” similar to how a yolk is inside an egg. The parse tree above as well as the one below are both valid parses. The latter parsing is absurdly implausible to human; however, a computer, having no knowledge or experience of the real world, has difficulty discerning which one is correct.

Madsen Parse Tree II

Alternative but valid parse tree (Madsen 2009).

One method to tackle this problem is to incorporate semantic knowledge, as opposed to only having lexical (words) knowledge, into ruled-based machine translation algorithms. This allows an algorithm to rule out alternative but otherwise grammatically valid translations that are implausible to a human. These algorithms are known as knowledge-based machine translation and attempt to validate a translated sentence by applying prepositional logic. For instance, such an algorithm may attempt to validate the act of giving something to someone by applying a function “G” (giving) with the parameters “O” (object) and “R” (recipient or indirect object). Thus, G(O = “toy”, R = “child”) would be validated, because it is not humanly implausible to give a toy to a child; however, G(O = “child”, R = “toy”) would be not valid, because it is implausible to give a child to a toy.

The Qualification Problem

This solution, however, leads to another problem inherent throughout not just machine translation, but the entire artificial intelligence field. As one might be able to deduce, for this knowledge-based system to work, the algorithm needs to know everything in the world beforehand. The slightest of differences between what it sees and what it knows renders is problematic to this algorithm. This issue is known as the qualification problem: that is, it is difficult, if not impossible, to qualify everything in the world a priori. This is not to say that rule-based machine translation algorithms are entirely useless; rather, they can be useful, especially if the given context is limited to a particular topic (e.g., ticket orders). Furthermore, there are commercially available ruled-based machine translation systems, such as Systran (Babelfish).



kuiwon.wordpress.com

 

Copyright Notice

 


Viewing all articles
Browse latest Browse all 7726

Trending Articles