User Generated Scores

The simplest system is a user-mediated system where users vote on documents, translations and other users. The idea is to give users a means of measuring interest or quality. There are several ways to prompt users and count votes, among them:

Simple binary voting systems (up/down, plus/minus, good/bad)
Subjective quality scale (1 to 5 stars, drop down menus)
Implicit voting systems (number of times viewed, average time spent viewing a page)
Surveys

Binary Voting Systems

The simplest form of a user based voting system is the binary system, where users are asked to answer a simple question giving to options, such as "Is this a good translation [yes/no]". This can be presented to the user as an icon based interface (up arrow, down arrow), or as a question. The choice of the interface will depend on the visual design of your system. While an icon based interface may seem like a better choice, it can be ambiguous about what a yes vote means, whereas a well phrased question will make it clear that you are asking the user specifically about the quality of a translation.

The data logged in this type of voting system will consist of the following fields, all of which are easy to store in a typical SQL or key/value data store:

date/time stamp
unique ID, hash key or serial ID of the translation edit being scored (this ID should be unique for every edit or instance of a translation for a document, so scores are linked to a specific edit, and can be tracked back to the translator)
voter IP address
voter location (city, country, lat/long coordinated), derived from IP address via geolocation service such as MaxMind
voter username (if logged into a registration system)
vote (+1 or -1)

From this raw data, you can easily compute summary scores (e.g. an average of all scores, standard deviation to measure variability, etc). The summary statistics can be generated in a batch process based on a schedule, on demand, etc.

Subjective Quality Scale

A subjective voting system asks users to score texts on a continuum, ranging from poor (0 points) to excellent (5 points). If users are properly trained, this allows you to collect an extra dimension of data, specifically about the skill level of each user. The binary system is designed primarily to identify very good or very bad or malicious contributors, and does not distinguish between a translator who is proficient in a language but obviously not a native speaker. With a scaled based scoring system, you can associate meaning with a score, such as:

0-1 points : very poor quality, malicious edits, spam, should be discarded
1-2 points : poor quality, comparable to bad machine translation, should be discarded
2-3 points : fair quality, comparable to a decent machine translation, keep it if there is no better alternative
3-4 points : good quality, better than machine translation, proficient human translator but not native speaker
4-5 points : excellent quality, college level writing, native speaker skill level

The user interface for a system like this is pretty straightforward, as you can prompt users to submit scores via 0 to 5 star buttons or a similar method. Netflix is a good example for this type of voting systems. The main issue one has to be aware of in context with this system is that users can easily become confused and think that you are asking them to vote on how interesting the source material is or other things, which are not directly related to translation quality. Therefore, it is important to embed prompts in the interface so that it becomes very clear to users that this tool is used to assess translation quality rather than how interesting the source document or overall topic is. The data logged by this system is essentially the same except that the voting field is a numeric (integer or float) value instead of a boolean yes/no vote.

Implicit Voting and Scoring

There are also a number of ways you can assess quality and level of interest in translations via implicit web analytics methods. Among the techniques that can be used are:

Tracking page views for translations across different versions (good translations are more likely to be viewed and shared than others)
Tracking the time spent viewing a page by reloading an 1 pixel image periodically within a page.

This tracking technique will not give you an absolute or direct way of measuring quality, but you can usually derive from user behavior how good or interesting a text is compared to other pages or URLs in your system. This is related to the technique of using the Internet and web analytics to track links into your system, page view statistics, etc.

The number of page views is not an especially useful way of assessing translation quality because page traffic can be driven by many factors not related to translation quality or even the quality of the source material. Time spent viewing a page is, on the other hand, is a good signal of document quality, although you can't really make out whether users are staying on a page because the underlying subject is interesting, or because the translation is good. It is probably a combination of both. On the other hand, if most users abandon the page immediately, this is a strong signal of a quality problem that may be more strongly associated with translation quality.

You should not use this data to directly set or modify scores for translators, but you can use this data to identify problem areas of your site. If you see a topic or cluster of pages that have very low time on site statistics, this is a sign of one of the following conditions that editors should examine more closely:

The page is malformed or otherwise unusable or unreadable
The underlying source document is not interesting
The translation is very poor, a malicious edit or spam

Think of this as a general purpose quality assurance tool for your website, translation portal, etc rather than as a direct measure of translation quality.

Surveys

Online surveys are a useful tool to ask specific questions to your visitors, but rather about your website as a whole than about an individual person or translation. Examples of questions you might ask people include:

On average, how would you rate the quality of volunteer translations on this site (1-5 stars)?
On average, how would you rate the quality of the source pages or articles on this site
How interesting are they?
What would you like to see in a more detailed way? (free form question, category multiple choice, etc).
What type of material are you most interested in translating? (free form question, category multiple choice, etc).

Again, this is not something you can use to directly evaluate individual translations or rate translators, but it is a good way to read the mood of your audience, and get a more detailed information about the type of translations or content they are interested in reading or contributing.

Framing Questions and User Interface Design Issues

Sometimes it does not turn out clearly whether they are voting on how interesting the original article is, or whether they are voting on the translation itself. No matter how clear the symbol is, you need specific explanations that frame the question in each language you plan to quiz users in. Tooltips are a good way to provide additional prompts on top of a visual interface (e.g. up/down arrow icons).The tooltip should clearly explain what you are asking the user to vote on, for example:

"Please rate this translation, NOT the article or content. Rate the translation from 0 (very poor / spam) to 5 (excellent /native speaker)."

The important thing is to clearly state what the user is being asked to vote on, and if you are using a subjective quality scale, be sure to give an unambiguous definition of what different scores mean (e.g. 5 means excellent / native speaker, while 4 means very good / proficient / not quite native quality). As various users will perceive each threshold differently, they should be properly briefed to allow you to collect useful statistics.

For a binary voting system, the question should be framed in a different way, and the votes should be linked to a particular system action. For example, an UP vote might also add that translator to your list of favorite translators, so you will see more of this specific translator's work in the future. On the other hand, one or several DOWN votes may add a translator to your blocked list, so that translations of this translator will be blocked from your view. This is especially useful as a way of detecting and filtering spam submissions.

Voting Privileges and Weighting of Votes

You may decide to combine a user generated scoring system with an expert driven scoring system. In this type of hybrid system, you collect the scores of two different user populations: ordinary or anonymous users, and known trusted users or editors. You can then evaluate the scores submitted by each type of user, using a formula such as:

composite score = (editor score x 0.70) + (user score x 0.30)

This type of weighted score thus treats expert scores as more significant than user scores. You can adjust the weighting ratio according to your own preferences.

Assigning Credit (and Penalty) In Collaborative Translations

As a general rule, you should collect votes on the same unit of texts as translators work on. If your system is paragraph based, scores should be collected for paragraphs, not the document as a whole, because many translators may have contributed to it. It is also important to design a good revision tracking system so that the system obtains a memory of every edit and action within a translation, and not just of the most recent changes.

It is impossible to accurately evaluate units of work, because a translator who only changes a few words, may have contributed significantly to the quality of the whole work, while another translator may rewrite a large block of text, but not significantly change the end result. The best thing to do is to assign a score to whoever touched the item being scored most recently. If you collect enough votes, the distribution between major and minor edits will lead to an average value.

Analyzing Statistics

Calculating raw scores, averages and standard statistical measures (e.g. standard deviation) are easy to do on demand from your raw score logs. If you have a large and dynamic user community, you will probably want to perform additional statistical analysis, both to drive automated rules engines (e.g. which submissions to allow or block), and to learn more about your user community. Examples of the types of statistics you may want to generate are:

Regression plot of translator scores over the time / age of account, this will tell you if a translator or a translator population tends towards better or worse quality seen over a longer period of time (e.g. translators are learning and improving with growing practice)
Geographical distribution using geolocation data associated with user IP address
Temporal distribution : when and how quickly are translations streaming in when an article is published, what is the typical time frame distribution, is there a relationship between source article age and translation volume or quantity?
Detecting unusual geographical (location), network (IP address / block) or temporal voting patterns, which are a signal for fraudulent or robotic voting activity.

The outputs taken from these processes can be human readable reports and graphs for editors and managers, as well as machine readable reports that can be fed back into rules and workflow management issues (for example, to automatically compute the optimum cutoff threshold of quality scores for allowing or rejecting translations in a peer reviewed system).

Filtering Submission Based on Quality Scoring

Once you have obtained raw and derived statistics from sufficient votes, you may want to feed the data back into your translation management system, either by using summary information to design system rules, or by automatically generating rules based on the statistics you generate. Examples of the types of rules you can generate in this way can be:

If a translator receives > x DOWN / NO votes from independent users or IP addresses, BLOCK this translator from display to ALL visitors
If a translator receives > x UP / YES votes from independent users or IP addresses, highlight them as a featured translator and display their submissions first
If a translator has received fewer than x YES or NO votes, or has a standard deviation > y, treat them as unknown and require peer review for their submissions first
If the system monitors a burst of votes from a certain location, IP address range or within an unusually short time period, IGNORE all votes submitted during this time period or from that source (assume these votes are suspect and possibly robotic).

There is no limit to the variety of rules you ca generate from the statistics you collect. In general, for smaller translation communities you won't have a large database of raw votes, so you'll probably want to use simpler scoring criteria and rules, while for a large translation community you will be able to collect a lot of data and learn a great deal about your user community.

Incentives, Penalties and Community Dynamics

Add some text on how incentives promote good work, while negative social pressure can deter incompetent or malicious individuals from returning. Also discuss balancing positive peer pressure with negative pressure, depending on the mood you want to set for translators.

Summary

User mediated scoring systems, when properly designed, are simple to implement and can collect a large amount of raw data that can be further analyzed to develop a picture of who your users are, what they are interested in, and who is doing good or bad work. This also increases participation in the system because individual users can see how their voting activity influences what they see, while translators have a reward incentive to do good work, as this will increase their score and reputation within the system. In all volunteer systems, reputation itself is a form of currency because translators will often use their profiles and by-lines to promote themselves, the work they do, and their companies.

Expert Driven Voting Systems

In an expert driven voting system, editors or other privileged users are able to vote on translator submissions via a separate administrative interface. This can be done in place of or in addition to user realted scoring systems. As your editors and trusted users have a wider knowledge of your system, standards and practices, you can ask them using a different set of scoring criteria. Examples of the types of questions experts can vote on might be:

Assign quality scores for several dimensions of translation quality (e.g. one score for grammar, one for syntax, another score for general style and a different one for quality of writing).
Allow editors to accept or reject a large number of submissions in bulk via an administrator's web interface for bulk actions.
Experts can be trained to use consistent criteria for assigning different quality scores. (Users can do this, too, but may lack the training and experience to do so in a consistent quality).

The data you collect from experts can be used independently or as a composite with user related scores to decide whose translations to accept, who are the best translators, and so forth. You can also generally assume that these users are to be trusted to take decisions without extensive cross-checking (such as a decision to ban a user for submitting spam or inappropriate material).

Framing and Standardizing Questions For Editors

As with ordinary users, it is important to frame questions correctly for your editors. This is rather a matter of training and documentation than it is one of web interface design. You will typically provide a simple set of voting tools that are part of an administrator's interface. This interface will be more utilitarian, so these users can work in bulk, and process large volumes of submissions.

The important thing is to document what the process is for scoring users, what the criteria and thesholds are, and how to perform common administrative tasks. This document should cover topics such as:

On a 5 point quality scale, which skill levels or quality are required for a given score level (e.g. 4.5 to 5 points means excellent / native speaker, while 4 to 4.5 might mean excellent but not quite native level).
How do you ban or block a user, IP address or a whole address range?
How do you perform bulk actions, such as accepting 50 translated texts in a single batch for publication?
How do you perform common administrative tasks?
What is the editorial basis for rejecting translations or banning a user?

This is primarily a matter of documentation, as well as providing an online forum where editors and supervisors can meet to discuss items, pending texts or have general discussions related to the system, translations, etc.

Hybrid User / Expert Systems

A hybrid system combines statistics from ordinary users and editors. There is a variety of ways to do this, including:

Generating composite, weighted scores that represent both user votes and votes from editors or supervisors.
Implementing rules where editor decisions may override system defaults (e.g. an editor manually bans or blocks a user or IP address block from the system)

Composite Scores

If both, ordinary users and editors are being asked to submit similar quality scores (e.g. a five point scale), these scores can be combined to generate a weighted score using the formula below:

 composite_score = (average_editor_score x editor_weight)
                    + (average_user_score x user_weight)

where

editor_weight is a weighting factor ranging from 0.00 (0%) to 1.00 (100%)

user_weight is a weighting factor ranging from 0.00 (0%) to 1.00 (100%)

and editor_weight + user_weight = 1.00

This is a simple weighting formula. However, you may want to use different weights depending on how many votes users have submitted, the logic being that a large set of user votes will be statistically normal, and not easily skewed. In this case, you may use a weighting formula that adjust the weights, that is to say, when there are a large number of user votes, the user generated score has a weighting factor closer to 1.00, and when there is only a small number of user votes, the editor's score will have a higher weighting factor. There are countless variations on this theme.

Deciding Which Rules Take Precedence

Another issue to consider in hybrid scoring systems is which generated rules take precedence. To give an example: when should an editor's decision to ban a user take precedence over users' decisions to UP vote or allow the same user. You will probably make the following assumptions in defining these policies:

Editors are more trustworthy, so his decision to allow or ban/block a user overrides the others
Large numbers of user votes, if they are not artificially generated, may be more valuable than subjective opinions of individual editors, especially if the goal is to assess quality from the reader's perspective (the editors may be language geeks who downvote a user for minor errors that readers ignore)
Small numbers of user votes are less statistically significant, and easily skewed, so in situations where an editor score diverges from a user score that was calculated from just a few votes, you may want to assign more weight to the editor's score.

Internet Based Scoring

The Internet itself is a valuable tool for measuring how interested users are in your content and translations, and for detecting major quality or system problems via standard web analytics. This technique is not generally useful for assessing the individual quality of translations, but will tell you things such as:

Which source articles or translations are most popular in general or within a particular topic, domain, or language
The "useful life" of a source article and its translations (how quickly does the content age)
Average time spent looking at the site and at individual pages
Browser capabilities, language preferences and other visitor information
How linked is an article, both in terms of inbound links to the source article, as well as third party links to the translated versions

This data, in turn, will tell you what people are most interested in (individual pages, categories, etc), which languages they are reading, translations they prefer or are searching for, where you have major quality problems (very low time spent viewing a page or section), where and how people are visiting your site, and so forth. This will enable you to make editorial decisions about where to direct translators to spend their time, where you focus on marketing your site or service, and so on.

One way for using this technique to detect bad translations is when you are routinely translating source texts to multiple languages. In this case, you may see a significant difference in time spent viewing the various pages, e.g. Japanese users abandon a page within 20 seconds, while French users spend several minutes viewing the French version of the text. This is an indication of a significant difference between two translations which may be due to quality reasons, presence of spam, wrong presentation, etc.

Summary

The mere existence of a voting system is a huge incentive by itself. It enables users and editors to cross check each others, and creates the basis for a rewards based system (or penalty system) that drives the community towards increased participation and a higher quality level of the work. As you can see, there are several techniques that can be used in parallel to assess source text and translation quality, and derived from that, the quality of each user's contributions to the system. In considering these different methods, we'll highlight every method and its relative strengths and weaknesses.

User Mediated Scoring Systems

User mediated systems work best when you have a large user community looking at a steady stream of source texts and the respective translations. It is then relatively easy to prompt readers to score source texts (do they merit translation?) and translations, providing enough votes to develop an accurate picture of what people want, and where the good and weak actors are. The system's main strength is its simplicity, both in terms of what it asks of users, and how the data is stored and analyzed. Its main weakness is that it requires a large number of votes per user system-wide to generate useful reports.

Expert Driven Scoring Systems

Expert driven scoring systems are useful both as a fail-safe (to override bad or damaging contributions or decisions from users), and as a way to compensating for a low volume of user submitted scores (which may often be the case for smaller or less engaged user communities).

Internet / Web Analytics

Internet and web analytics are generally useful for learning who your users are, what they are reading or translating, which languages they read, and where they are spending time. This will help you to understand on which kind of content they are most interested in (and where to direct your translators to spend their time), and where you may have problems within your site (e.g. your Japanese readers are abandoning pages quickly while French users linger around far longer). They generally will not evaluate translation quality, at least not directly, but play a major role in understanding your user community, how people are finding you, and so on.