Machine translation

The field of machine translation (MT) is many decades old, and in fact dates back to the early days of the computer industry. In this chapter, an overview is provided of how machine translation systems function, the use of machine translation and its limitations.

The basic concept of machine translation is to automatically translate a source text in one language into another language. As there are hundreds of widely spoken languages, the number of potential language pairs is astronomical. For this reason, machine translation systems tend to focus on a relatively small number of "high traffic" language pairs where there is demand for translation, such as English to French, French to German, etc.

The demand for translation, not surprisingly, roughly mirrors major world trade routes. The first lesson is that machine translation is only an option for relatively few languages, mostly major European and Asian languages. Open-source machine translation projects, Moses and Apertium, will change this, but building machine translation systems takes work, so this will not happen overnight. This is an important point to be stated up front, that machine translation is not a magic technology. It takes a great deal of effort, either in the form of writing linguistic data or collecting training data to build these systems.

Fortunately, machine translation solves a big problem for governments, world trade, etc. and as such has received a considerable amount of funding over the past several decades. Systems currently in use today, such as Systran, trace their development back for many decades. In the United States, the military and intelligence agencies fund machine translation research, as a way to multiplying their ability to see and understand what is happening overseas. In Europe, the creation of the EU has resulted that governments and private companies have made their laws and trading practices more homogeneous. In general, better communication equals better trade, so in Europe translation efforts have been focused in this area. The majority of translation engines in existence today trace their origins back to one of these two sources.

Machine translation technology

In all but the most basic cases, translation is not a simple algebraic process where sentences can be translated according to predefined rules. It demands that the translator, human or machine, understand what is being said, which is often not said, in order to produce a high quality translation. It uses either predefined rules or statistical patterns derived from training data to calculate how a source text may appear in another language. However, it does not understand what it is translating any more than a calculator understands the meaning of pi. Human language is a way of describing a situation or environment, one that includes the full range of human senses, emotion, etc. This explains the limitations of the technology, which will be covered in more detail later in this chapter.

Machine translation today falls into one of three broad categories: rule-based translation (sometimes referred to as transfer-based machine translation), example-based machine translation and statistical machine translation. Rule-based translation is based on the idea that languages have a fairly small set of basic rules, such as how to conjugate verbs, a larger set of exceptions to the basic rules, and a vocabulary or dictionary. Statistical machine translation is based on a different approach. It compares directly translated, or aligned, texts to detect statistical patterns. Phrases that usually translate a certain way show up in many different texts, so the system learns that "white house" usually translates to "casa blanca" in Spanish.

In academia, a large body of text is called a corpora. In machine translation, a database of common translations (e.g. "white house" to "casa blanca") is called a Translation Memory.

Rule-based translation

Rule-based machine translation is based on the idea that to make a translation it is necessary to have an intermediate representation that captures the meaning of the original sentence in order to generate the correct translation in the target language. In interlingua-based MT this intermediate representation must be independent of the languages in question, whereas in transfer-based MT, it has some dependence on the language pair involved.

The way in which transfer-based machine translation systems work varies substantially, but in general they follow the same pattern: they apply sets of linguistic rules which are defined as correspondences between the structure of the source language and that of the target language. The first stage involves analysing the input text for morphology and syntax (and sometimes semantics) to create an internal representation. The translation is generated from this representation using both bilingual dictionaries and grammatical rules.

It is possible with this translation strategy to obtain fairly high quality translations, with accuracy in the region of 90% (although this is highly dependent on the language pair in question — for example the distance between the two).

In a rule-based machine translation system the original text is first analysed morphologically and syntactically in order to obtain a syntactic representation. This representation can then be refined to a more abstract level putting emphasis on the parts relevant for translation and ignoring other types of information. The transfer process then converts this final representation (still in the original language) to a representation of the same level of abstraction in the target language. These two representations are referred to as "intermediate" representations. From the target language representation, the stages are then applied in reverse.

This all sounds straightforward, but it is a lot harder than it looks. This is because computers are very inflexible, so when they encounter something that it outside their programmed rules, they don't know what to do. Examples of the types of things that can confuse a computer are:

Words slightly out of order, e.g. "I think probably that this might be a good idea". A computer will be totally confused about what's what.
Words that have many meanings depending on context, e.g. "The boy was looking for a pen, he found the pen, it was in the pen." Humans can usually easily understand the context of a word or phrase, a computer on the other hand does not have the encyclopaedic knowledge available to humans.
Colloquial expressions, shorthand, and other common types of speech
Typographic errors.

So now, in addition to writing software to capture every rule in a language, you also have to write software to deal with exceptions to these rules, which can include anything from conjugating irregular verbs, to dealing with common typographic errors. This requires a lot of effort, so not surprisingly good rule-based translation systems exist for a relatively small number of language pairs because of the up front time and money cost of creating all of this information, it can take up to six months to create a basic rule-based MT system from scratch.

Statistical machine translation

Statistical machine translation, developed more recently as computers and storage became cheap, is based on the idea that if you feed a very large number of directly translated texts (known as aligned
texts), a computer can detect statistical patterns for how phrases are translated. The advantage of this approach is that the training process is automatic. If you have a large enough training corpus
(typically at least 30 million words of aligned sentences), the system can train itself. The important difference with statistical translation is that it is a "blank slate". The computer has no pre-programmed rules or assumptions about how a particular language is constructed. This is both a strength and weakness. It is a strength because languages are filled with exceptions and unusual phrases. Once a statistical system learns how a phrase translates to another language, it does not need to know why. This is a weakness because it has no internal knowledge of the rules of a language, rules which can guide it when it can only translate part of a text. The most popular form of statistical machine translation currently used is phrase-based statistical machine translation, and that is what is described here.

The simplest way to explain how a statistical translation engine trains itself is as follows. Let's imagine that we have a corpus of a million sentences that have been translated from English to Spanish. The translation engine will crawl through the entire set of sentences, breaking each down into smaller blocks of words called n-grams. The sentence "Hello my name is John" could be divided into 4, 2 word n-grams (hello my, my name, name is, is John). This step, breaking texts down into smaller units, is the kind of repetitive task that computers do very well at. Typically these systems will break sentences down into blocks of several different word lengths.

This is where we get into the statistics of statistical machine translation. What the system wants to find are n-grams that are strongly associated with n-grams in another language. Let's say that of the one million training sentences, 1,000 of them contain some version of "Hello, my name is John". Given enough data, the system will learn that 'my name is' is strongly correlated with 'nombre es' in the Spanish translations. The system does not need to understand these words in any way, it is just counting the number of times each n-gram appears in aligned texts, and based on that calculating how strong the correlation is.

So let's imagine that we've trained the system with 1,000 variations of "Hello my name is ____". Then we give it the expression "Hello, my name is Dolores". It has not been trained to translate that sentence yet, but it has seen many other examples. It will break this sentence down into n-grams, and for each one, it will find the best matching translation. It will be highly confident that hello = hola, and my name is = mi nombre es, so it will probably translate this as "Hola, mi nombre es Dolores". You can see the logic of this approach. It works well when you have a large enough training corpus, with some important limits. Statistical translation, like rules based translation, has its weak points, which include:

Requires a large set of aligned texts, usually millions of sentences to produce adequate results, these training corpora usually require a lot of work to prepare, and are not available for the overwhelming majority of the world's languages
Very sensitive to text mis-alignment, where source texts are slightly out of sync with translations (a single error in a line break, for example, can ruin the alignment for every sentence pair that follows)
Potential for information loss when translating between languages with different rules for things such as formality or cases.

The basic concept is that, given a large enough training corpus, you can stitch together composite translations by treating sentences as collections of smaller units. This is a simplified explanation, but it should get the idea across.

Statistical versus rule-based translation

Each system has distinct advantages and disadvantages depending on the languages used, application, etc. The short answer, in comparing the two, is that statistical translation can produce very good results, when it is trained with a large enough training corpus. For languages where that is the case, English to Spanish for example, there is a large amount of training data to use. The problem is that for other language pairs, it is very difficult to find enough data, at least in machine readable form, with which to train the translation engine.

Rules based translation systems require a lot of curating, to program the engine with a language's core rules, most important verbs, and vocabulary, but they can also be developed procedurally. Rules based engines are also a good choice for closely related languages where the basic form and vocabulary is not radically difference. Apertium, an open source rules based engine, has done a lot of work building translation engines for related languages including Spanish, Catalan, Basque and Galician.

If you have a sufficiently large translation corpus, statistical translation is compelling, but for most language pairs, the data simply does not exist. In that case, the only option is to build a rule-based system.

Limits of machine translation

The limit of machine translation is simply that computers are calculators. They do not understand language any more than a calculator understands the meaning of pi. Human language is a form of shorthand for our environment, physical and emotional states. With a single sentence, you can paint a picture of what is happening and how someone responds to it. As you comprehend the sentence, you picture those events in your mind. When a computer translates a sentence, it is simply processing strings of numbers.

The strength computers have is that they have an essentially infinite memory with perfect recall. This enables them to do brute force calculations on millions of sentences, to query a database with billions of records in a fraction of a second, all things that no human can do. Intelligence, however, is much more than memory.

Machine translation will continue to improve, especially for languages that are heavily translated by people, because these human translations can be continually fed back into translation engines to increase the likelihood that when they encounter an unusual text, someone somewhere has translated it. That, in turn, is the fundamental limit for a computer. A person can build a visual picture of what is described in a scene, and if that description is incomplete, he can make a decent guess about the missing part. A computer doesn't have that capability.

With this in mind, machine translation is a great tool for reading texts that would otherwise not be accessible, but it is not a replacement for human translators, just as you would never expect a computer to write a novel for you.

Machine-assisted translation systems

Machine assisted translation systems solve this problem by using machine translation where it is strongest, for quickly obtaining rough draft translations that in turn, can be edited or replaced by people. The workflow in this type of system is usually something like the following:

System obtains and displays a machine translation when no human translation is available

Users can edit or replace the machine translations, either via an open wiki-like system or via a closed process (trusted translators only)
The system displays human translations in place of machine translations
Users may be able to score, block or highlight translations based on their quality, which enables the system to make additional decisions about which translations to display

There are several ways of implementing systems like this, depending on the type of application, languages needed, budget, etc. Among the options are:

Crowd-sourced or wiki translation systems, where anyone can edit and score translations
Managed translation systems, where only approved translators and editors can edit translations
Mixed systems where submissions may be open to the public, but are screened by other users or editors prior to publication

In many cases, such as with Google Translator Toolkit, the translations that are given back to the machine are stored in a Translation Memory (TM), so that if the same phrase is translated again the machine translation algorithm already has a reference of how a human did that particular translation. The fact that Google does not share this aggregate data back with the public is bad news for humanity.

Free and open-source translation engines

Until recently, all of the machine translation systems in use were closed, proprietary systems (e.g. Systran, Language Weaver). Users were dependent on vendors to support and maintain their platforms, and could not make their own modifications. Fortunately, the open source model is spreading to machine translation. These systems can be modified as needed to accomodate new languages, and will enable small, independent teams to create a wide variety of translation engines for different languages and purposes.

Apertium (http://www.apertium.org/) is an open source, rule-based translation engine. It is being used to build translators for closely related languages (e.g. Spanish and Catalan), and also to build translators for less common language pairs, where the data required to do statistical translation simply does not exist (for example Breton to French, or Basque to Spanish). The software opens the door for groups of amateur and professional linguists and programmers to build translation engines for almost any language. This will make rule-based translation available to both smaller commercial opportunities (less widely spoken languages), as well as scholars and hobbyists (e.g. translators for Latin).

Moses (http://www.statmt.org/moses) is an open source statistical translation engine that, like Apertium, enables people to build their own translation platforms.The system is aimed mostly at research into statistical machine translation. The engine can be used to make a machine translation system for any pair of languages where there exists a large enough parallel corpus.

The Worldwide Lexicon Project (http://www.worldwidelexicon.org) an open source translation memory, has also developed a multi-vendor translation proxy server that enables users to query many different machine translation engines (both open source and proprietary systems) via a simple web API. Most of the Worldwide Lexicon code is released under a BSD-style (permissive) license. For more information on Licensing, see the section under Intellectual Property.

Machine translation