Open Translation Tools

Wikipedia

Most readers will have encountered Wikipedia articles in the course of searching for information on the Internet. Many of you will know that there are versions of Wikipedia in multiple languages. What is less well known, except to the people who are active contributors to these projects, is that there are over 250 language versions and that each language version is an independently run project with its own policies, administrators, and content.

While they all have the same aim, some are better at providing information than others. The English language Wikipedia, with around 3 million articles, gets half the traffic. The Buginese language Wikipedia, though, has only 83 articles and may require you to install a font just to see them. 

Each project is addressed to the speakers and readers of its language; this means that the content is written in that language, but above and beyond that, the articles are written from the cultural perspective(s) of the peoples that speak that language. For example, the English language Wikipedia will cover topics from the various perspectives of English-speaking populations from the United Kingdom, Australia, the United States and so on, the Greek language Wikipedia will cover articles from the cultural outlook of the inhabitants of Greece and Cyprus, etc.

This does not mean that every Wikipedia community must write all of its articles from scratch. In fact there is a vast amount of content sharing between the projects which often involves the translation of articles in whole or in part. In particular, articles about historical events or living persons from a particular region are more likely to have an article chock-full of good information in the WIkipedia of the language of their region. In some cases an editor may choose to translate the entire article as is, leaving it for others afterwards to adapt it or add information that makes it relevant to local readers; in other cases the editor may translate only parts of it or may write a summary or adaptation. Once the article has been created on the local project, it has a life of its own and is treated like any other article as far as editing, expansion or even deletion.

Translation Process

While all the Wikipedias are autonomous and define their own policies, the reality is that the English Wikipedia is the model that most projects emulate and the content of the English Wikipedia inspires the creation of many articles. There is a continuity in this from people literally translating an article to people reading the article and either writing a summary or re-writing the article in their own words. 

There are no formal mechanisms for splitting up a text and parcelling it out to multiple translators. Since the translator community consists of all users, registered or not, who choose to translate part of an article, there is no mechanism for communication between all members of the translator community concerning a particular text except by adding content to the text itself. People who translate on a regular basis tend to set up informal communication mechanisms among themselves, usually by email, irc or skype.

The choice of what material is translated, re-written, or written is completely left to the people who volunteer to create these articles. As a result there may be a bias towards certain topics. Sadly, there are no tools that measure which missing articles have most been requested.

There is also no formal relation between articles on the same subject in the different Wikipedias. The articles are essentially different. For the readers of the article there are "interwiki links" that connect these articles. In this way a reader can learn about a subject from several cultural and linguistic points of view.

Additionally, there are technical challenges when an article is translated in its entirety. Wikipedia makes use of what is called "Wiki syntax", which allows for the creation of "templates" and "info boxes". These constructs require localisation and this requires the kind of expertise that is associated more with programmers then with translators.

The Technical Process

So, just how does this process work? What is the Wikimedia workflow, how do we coordinate translation among multiple editors, how do we deal with attribution (giving credit to the authors of the original article)?

The process will differ from one language community's Wikipedia to another; recall that each project is autonomous. But typically the procedure works like this:

  • The editor decides that he or she wants to translate some or all of an article from another WIkipedia version.
  • The editor makes a request for the importation of the article from the other Wikipedia. The import mechanism brings over the full text of the current version of the article and all previous versions, as long as the date, time and author of each version. In addition the discussion page of the article may be brought in; this can contain discussions about the title, removal or alteration of controversial content, pointers to external resources, and so on. It may also contain attribution information in special cases as we will see below. The request for importation may be made formally by leaving a message on a page for requests to administrators, since importation is a privileged operation not available to all users, or it may be made by leaving a note on the talk page of an administrator that the editor knows well or believes is active at that moment.
  • The article is then imported by an administrator who specifies the language, the project name and the page name for import, clicks a button, and waits. In some cases, when the article for import has been edited many times, the import may fail. If repeated attempts fail, the article must be copied in by hand, and the version attribution information must be copied in by hand as well. Typically that information will be placed on the newly created local article's discussion page for reference.
  • Now the work of translation can begin. There is no formal workflow management mechanism; projects have developed their own informal methods for doing this. One typical approach is that the editor can tag the newly created article with a template at the top of the page which says for example "This article is being edited for an extended period of time. Please do not edit while this notice is visible." The reader might wonder how, among the many articles on a project, someone would happen to choose to edit the very one that has just been imported, but that's part of the way that Wikipedia and the other Wikimedia projects work. A new editor may often have no idea where to jump in, given the huge pool of information available for editing or cleanup, so they may often check the page listing "Recent Changes", which shows up to 500 of the most recent edits, including article creations.
  • The editor doing the translation will now procede to edit the article online using the edit form presented to them in the web browser. (Smart editors doing long texts will edit off-line, saving their work frequently, and cutting and pasting the results into the form when they are ready to upload.) Depending on the length of the text and the editing style of the contributor, they may save after translating one or two paragraphs, after every sentence -- though this is frowned upon because it fills the recent changes list with many small edits -- or after translation of the full article.
  • Notice what has not been mentioned here. There is no translation memory, there is no central glossary, there is no mechanism for handing the translation off to a proofreader, there are no quick links of terms to multilingual dictionaries, nothing. Our translators are by and large volunteers who do not have access to professional tools, so they are basically on their own. A given language project may have a glossary of terms with suggestions for translations, but there is no requirement that the glossary be followed, and indeed there is no way that such a mechanism could be enforced, unless an editor checked each translation against such a glossary manually.
  • When the editor saves the article, they comment out any untranslated text remaining, using HTML-style comments (<!-- and -->).
  • Once the editor has saved the part of the translation they intend to complete that day, they remove any notices or templates from the page, opening it up to others for editing. Additionally they will usually add a template at the bottom of the page or on the discussion page which provides a link to the original article as the basis for the current page. Editors go through the import process in addition to providing a link back to the original article because there is no guarantee that the original article or the version of the article that was translated will not have been moved or deleted when a reader views the translation in the future.
  • Another editor may notice the article in the recent changes list and may decide to work on another piece of the translation, if there are other parts to be done. There is no requirement that they do paragraphs in order or indeed that any more of the text ever gets translated. Two years might go by without further changes to the article, or editors might add content based on other sources without ever referring again to the source article.
  • When multiple editors have translated parts of the same article, differences in style or the way specialized terms have been rendered in translation may make the new article a bit hard on the eyes. As with readability issues unrelated to translation, any editor is free to clean up such inconsistencies as they see fit.
  • Machine translations of articles are routinely deleted from some Wikipedia versions. While the contributor who adds a machine translation probably does not speak the target language and hence has no other way to add content in that language, most communities view the addition of machine-translated content as doing more harm than good. Since all editors work on a volunteer basis, they generally prefer to work on articles of interest to them, rather than cleaning up machine-translated articles in an area that they may not know or care about. For some editors, another disincentive for editing machine-translated content so that it can be retained is that cleaning up machine translated content may be more time-consuming than translating straight from the original text.

Translating requested content

Some materials such as Wikimedia Foundation policies, surveys, or informational notices are disseminated to all of the projects, and these require translation across all projects within a relatively short period of time. These materials are typically the subject of a request for translation, handled on the project coordination site, meta.wikimedia.org. Because there is a time constraint, translators check these pages more frequently and may respond to requests fairly rapidly if the size of the translator community in a given language is large enough to permit it. No importation of text is necessary; one creates an article of the same page name with the language code appended (typically by clicking on a link to the missing page on a list of missing translations), copies in the original text by hand if desired, and then proceeds to edit. Translating off-line in these cases is less desirable, if only because another translator may be watching the same language version of the article and decide that the first editor is not activeliy working on the article if no changes are recorded in the recent changes list after a half hour or so. If the second editor then proceeds to translate the very part of the text the first editor was working on off-line, someone's effort will have gone to waste.

Quality and Quantity

Many people involved in the Wikipedia community care about the quality and quantity of the articles. This resulted among other things in a "must have list of articles". This list, while well intentioned is very much culturally biased; do people who speak Buginese care about American football and would an article in Buginese have the same relevance..