Open Translation Tools

Translation Memory

A translation memory, or TM, is a database that stores paired segments of source and human translated target segments. These segments can be blocks, paragraphs, sentences, or phrases. Translation memory does not apply to word level pairs, which are covered by translation glossaries. A translation-memory system stores the words, phrases and paragraphs that have already been translated and aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”.

Translation memories are typically used in conjunction with a dedicated computer assisted translation (CAT) tool, word processing program, terminology management systems, multilingual dictionary, or even raw machine translation output.

Research indicates that many companies producing multilingual documentation are using translation memory systems. In a survey of language professionals in 2006, 82.5 % out of 874 replies confirmed the use of a TM. Usage of TM correlated with text type characterised by technical terms and simple sentence structure (technical, to a lesser degree marketing and financial), computing skills, and repetitiveness of content[1] [http://en.wikipedia.org/wiki/Translation_memory] 

Translation memory management involves looking into the TM files for translation that are identical or similar to the ones that we wish to translate. Translations found in the TM database can be identical to the sentence that want to translate, in which case we consider that we have a 100% match for our string, and that the translation should probably be good enough for the string that we want to translate.

Exact Matching Versus Fuzzy Matching

We can also find in our TM translations in which the source string is similar to the source string that we wish to translate, but not identical. In this case we qualify how different the TM string is in terms of percentage. We might find translations that are 95%, or 70% similar to the string that we want to translate, in which case we say that we have a 95% or a 70% match. If we use such strings, they will always have to be considered as fuzzy, and they will have to be reviewed by the translators.

Local Versus Network Translation Memories

Translation memories started out as locally hosted and managed programs (they have been in use since the early days of the personal computer era, before the notion of Internet services existed). As the Internet has developed and broadband connectivity has become ubiquitous, these services are migrating to network based SaaS (software as a service) offerings. Each approach has its down advantages.

Locally hosted translation memories are not dependent on an external network server and can scan the contents of its corpora, on local disk drive or system memory, and can do so very quickly. The burst capacity of a typical desktop computer is quite substantial, so these programs can implement highly compute intensive fuzzy matching algorithms that are difficult to implement on network based systems, and that do not scale as easily in an offsite environment (because of latency from the network connection, disk or data store access, and other performance issues).

Network based translation memories have another advantage because they can be accessed via a web browser or simple client application (e.g. a Java app), and make it easy for many people to work via a single shared translation memory. This approach makes more sense when there is a larger, distributed translation team, such as in a translation service bureau. The main weakness these applications have is that it can be difficult to configure the servers so they are accessible outside an organization's firewall. More and more vendors are offering pure web based translation memories, where all of the data is stored and indexed in a publicly accessible server or server farm.

Corpora

In linguistics, the term corpus (plural corpora) is used to refer to a large and structured set of texts, generally kept in a database for research purposes. A corpus is a set of linguistic data. It is important to see that TMs can make useful linguistic corpora, and that massive corpora in the form of TMs are most useful for statistical machine translation (SMT). In both academia and the commercial translation industry, there are many TMs and corpora that are kept privet. These are closed corpora. Closed corpora are counter-productive to the goals and ideals of Open Translation.

Vendors and Tools

Babylon ($)

Babylon is a commercial software suite that includes a combination of dictionaries, translation memories and machine translation services. It is not free or open source, but it is inexpensive, well designed and provides very broad support for multilingual dictionaries, domain specific glossaries, and other tools that professional translators need.

Google Translator Toolkit

Part of Google Translate, this is a web hosted translation editor with an integrated translation memory that can be used as a stand-alone translation editor, and also to enable teams of translators to work collectively on a document and to share their translation memory locally or with the web at large. The tool was launched in June 2009 to good reviews. The consensus view among translation experts is that its primary purpose is to backfeed professional quality translations to train their statistical machine translation engine.  GTK is not an open source tool per se, as it is a closed system run by Google, but it is free and easily accessible via the web. Google does reserve the right to charge for the service at a future date.

SDL / Trados ($)

Trados is a commercial software and the most widely known and used TM suite of tools in the professional translation world.  This is utilised both by the Translation Agencies and the translators as well.  It is a desktop baaed tool but can be adapted to large CMS as well.  SDL is one of the largest translation/localisation companies in the world.

Omega-T

Omega-T is a free, Java based translation memory that runs as a local, desktop application. It is similar to translation memory tools like Trados.

Worldwide Lexicon

The Worldwide Lexicon is a free, open source suite of translation memory and translation management tools that can be used to create a wide array of multilingual web applications and publications. One of its core features is an open translation memory that combines human and machine translations, and is accessible via a simple web services API. WWL is open source (BSD style license), written primarily in Python and runs on Google's App Engine grid computing platform.