Open Translation Tools

Fonts and Encodings

Every writing system uses a collection of letters, numerals, punctuation, and other symbols together with formatting, layout, and shaping rules. When we talk about writing with computers each letter or symbol is called a 'character', and an ordered collection of characters for a particular purpose is known as a 'character set'. There are well over 100 official character set standards in the world for 30 major modern writing systems, and many unofficial character sets. Unicode covers all of these and dozens of other writing systems.

Proper typography in any language requires several hundred characters, and in a few languages, several thousand.

  • The alphabet (or syllabary, such as Japanese kana), of course, except for Chinese Aa Bb Cc Dd Аа Бб Вв Гг Αα Ββ Γγ Δδ ջ վ գ ե かきくけこ
  • Numerals 1 2 3 一二三
  • Punctuation . , ; : ! ? “ ” « » – — ‡ • … ⁋
  • Grouping characters [ ] ( ) { } < >
  • Dingbats ✂ ✆ ✍ ✓ ❏ ❡ ➙
  • Math symbols + - × ÷ ± √ ∞ ∂ ∫ ↔ ↺ ∡
  • Other symbols ¥ ₤ ℃ ℞ ⌘ ╓ ◔ ☠ ☢ ♔

For most of the history of computers, and before that of Teletypes, typewriters, and even Morse code, nearly all of these refinements have been ignored in order to shoehorn a minimal set of characters into the available code space or mechanical keyboard and lever system. Most such spaces in recent times have been defined by the requirements of binary logic, so that the number of code points is a power of two--32, 64, 128, 256, 65536 for 5, 6, 7, 8, and 16 bits respectively. Unicode dispenses with these restrictions, and has space for over a million characters. The expectation is that no more than a quarter of that will be needed for human use.

ASCII

Ignoring semaphore, Morse code, and other such manually operated systems, digital character sets begin with the 5-bit code invented by Emile Baudot in 1870. Five bits is only enough for 32 characters, not enough for the Latin alphabet plus numerals. Baudot included a shift mechanism to access another 32 character space. Manual Baudot code became the basis for the first electromechanical teletypes, which evolved through many modifications to use various 7-bit character sets, the ancestors of US-ASCII, which was standardized in 1963. In the meantime, IBM invented and extended its EBCDIC character set, derived from punched card codes, and other manufacturers went in other directions.

The use of Teletypes as computer terminals on Unix systems in the early days of the Internet made ASCII the de facto standard for many computer activities, including programming languages (except APL), e-mail, Usenet, and so on. It also determined the shape of computer keyboards, which had to provide 47 keys for printable ASCII characters, unlike 44-key typewriters. Most computers offered an "extended ASCII" 8-bit character set, with various methods for entering the extra characters from the keyboard, but there were and are no agreed standards for such a thing.

Using such a restricted character set meant that many symbols had to be repurposed, or "overloaded" in computer jargon. For example, the '@' sign was in Teletype codes to indicate prices in commercial messages, and its use in e-mail addresses came much later. This overloading has caused innumerable problems for computers and people alike attempting to determine which usage is intended. Is "/$%//;#~" line noise, euphemistic cursing, or program source code? Should quotation marks be left as is in translation, because the text is part of a program, or converted to the usage of the target language community, such as «» in French?

The Tower of Babel

The rest of the world insisted on character sets more appropriate to their needs, with appropriate punctuation, currency symbols, accented letters, or even a completely different alphabet or other writing system. Thus, the computer industry entered a Tower of Babel period in which user communities and companies developed their own character sets in profusion. See http://en.wikipedia.org/wiki/Character_set#Popular_character_encodings for a partial list.

European languages required additional characters beyond the basic ASCII character set, to display "francais" in its proper form "français" for example. Asian languages have much larger character sets to represent thousands of different logographic characters. Their countries created a similar profusion of Double-Byte character sets, such as Big-5 (Taiwan), GB2312 (PRC), HKSCS (Hong Kong) Shift-JIS (Japan), EUC-KR (South Korea).

The variety and disorganization of character set definitions has created numerous problems for software developers and users. The problem began with the common assumption that a program would only be used with one character set. This might work for an accounting program that only complies with the laws of one country, but it has become increasingly untenable in word processing, databases, and so on.

The first reaction was to localize, that is, to create different versions of software using different character sets, with different languages in the User Interface. This is too much work both for developers and for those managing the variety of products that results. It also fails to satisfy the requirements of multinational organizations where individuals may need to work in two, three, or four languages, and the organization as a whole in dozens.

The problem with this is that it greatly complicates things for software developers because you typically needed to install special fonts to display these characters. It was also very difficult to display symbols from different character sets in the same display, which made it difficult or impossible to build bilingual and multilingual interfaces to display many source and translated texts in the same user interface or window. 

Encodings and Transformation Formats

A Character Encoding is a mapping between numeric code points and the letters and symbols of a given Character Set. Before Unicode, an encoding was also a mapping to a binary representation of the numeric code point, but in Unicode a code point is a number independent of any representation, and the various possible representations are called Transformation Formats.

We can understand this more easily in human terms. The number ten can be written as the English word "ten", or in various other forms, such as the familiar '10'. It is also '1010' in binary, '012' in octal, '0A' in hexadecimal, "十" in Chinese, "dix" in French, and so on. But it is always the same number. If we number the Latin alphabet A=1, B=2, and so on, are we assigning numbers to the letters, or are we assigning a written form to them? When we start to do ciphers using arithmetic on the letter values, it is clear that we mean numbers. Adding 1 to a letter value, with Z wrapping around to A, is a familiar example. It turns 'HAL' into 'IBM'.

It is important to understand that in Unicode, a code point is a number, not the digital representation of a number. A set of rules for representing code point numbers in binary defines a Transformation Format. For most character sets, where there is only one Transformation Format, it is easy to confuse the two ideas. In Unicode there is one encoding, but many Transformation Formats. The Chinese GB18030 standard is a different encoding of the Unicode character repertoire with the same set of Transformation Formats.

Non-Unicode encodings are classified, misleadingly, as Single Byte, Double Byte, or Multibyte. All 7-bit and 8-bit character sets have Single Byte encodings. Character sets that switch between 16-bit representations of Chinese characters and 8-bit representations of alphabets are said to have Double Byte encodings. Character sets that define a fixed-length representation for characters of two bytes or more are said to use Multibyte Encoding. The variable length UTF-8 form for Unicode fits into none of these three categories, and is described in Unicode terminology as a Unicode Transformation Format, not an encoding at all.

Unicode

Unicode contains a superset of existing character set repertoires, with each character at a separate code point. It was developed to create one giant character set that could represent every conceivable combination of alphabets, syllabaries, logographic symbols, and special characters.

A single byte character set can represent, at most, 256 unique symbols. Two bytes can represent up to 65536 characters. Three bytes or more can represent many millions of characters. The largest estimates of characters that might get into Unicode are less than 250,000.

The architecture of Unicode was originally set to provide a 32-bit code space, enough for several billion characters. It has since been redefined to consist of 17 pages of 65,536 code points each, for a total of 1,114,112 code points, at least four times larger than necessary.

 Bytes / Word
 Address Space
 Example Systems
 1  256 symbols
 ASCII, ISO Latin codes
 2  65,536 symbols
 Shift-JIS
 3  16,777,216 symbols
   
 4  4,294,967,296 symbols
 Unicode UTF-32  

Unicode code points are simply numbers. We can represent numbers in a computer in many different ways for different purposes. It turns out that there are several formats that various organizations prefer for one reason or another.  The following table illustrates the major formats, using the character U+1D11E MUSICAL SYMBOL G CLEF as an example.


 Name  Size, bytes
 Endianness  Example
 UCS-4/UTF-32  4  LE, BE
 0x1E D1 01 00, 0x00 01 D1 1E
 UTF-16  2 or 4 (with surrogate pairs)
 LE, BE  0xD8 34 DD 1E, 0x1E DD 34 D8
 UTF-8  variable  None 0xF0 9D 84 9E

Character Encoding (Why UTF-8 Is Your New Best Friend)

Unless you have a very good reason to use another encoding, your website, application, etc should use UTF-8 as its default encoding. UTF-8 is a variable length encoding system that works in concert with Unicode, and represents each symbol with a word ranging from 1 to 4 bytes in length, with the upper bit in each byte serving as a control register. This encoding method enables an application to encode standard western character sets very efficiently, and to step up to 2, 3 and 4 byte word lengths only when extended characters, pictographic symbols etc are required. This encoding standard also makes it possible to combine strings from many different alphabets and symbol sets in a single string, something that was difficult or impossible with older character sets.

Backward Compatible

UTF-8 is backwards compatible with ASCII characters, and the basic Latin character set. It was designed so that the first 128 code points map directly to the legacy ASCII character set. The system was designed so that code points would map to existing ISO and CJK (Chinese/Japanese/Korean) character sets, which makes porting applications to use Unicode/UTF-8 fairly easy (and trivial for ASCII/Latin applications).

Future Proof

UTF-8, in combination with Unicode, works with a much larger symbol space, up to 4 bytes per word, with up to 28 useful bits per word, for a total symbol space of 268,435,456 unique symbols, far more than are used in every alphabet and written language system in the world. In fact, the system can represent not only conventional writing, but also mathematical and musical symbol systems, and can be extended to represent new symbol sets as the need arises.

Issues For Software Developers

Beware - most programming languages have a significant ASCII "legacy" and will revert back to ASCII encoding, sometimes for no apparent reason. Python, the primary language used in Google's grid computing system, is a good example of this. Python is an excellent language. It is easy to read, but also a powerful object oriented language. It's one major weakness is that it has an annoying habit of defaulting back to ASCII, and doing so inconsistently. When this happens, and you do something like try to concatenate an ASCII string with a UTF-8 string, it will sometimes generate an error. The documentation on how to deal with this is rather poor, so you can spend a lot of time trying to troubleshoot something that should be handled for you automatically.

This issue exists in some form in many programming languages that have Western origins, so you should spend some time understanding how your preferred programming language deals with character encoding and be meticulous about using UTF-8 for string operations, as a default encoding, etc. One notable exception to this rule is Ruby, which having been conceived in Japan, had to deal with multilingual characters from early on. Most major programming languages bolted Unicode support in later in their development. 

Unicode Compatible Fonts

Many commonly used fonts are NOT Unicode compliant. When designing software and web services, you should take care to use Unicode compatible fonts wherever possible. The standard fonts (Arial, Courier, Times Roman, etc) are usually available in a Unicode-ready form. Fancier fonts, on the other hand, may be limited to Western scripts, and should not be used for multilingual applications. While users can sometimes override the font selection, and sometimes their system will do this for them, it can cause a very confusing situation where the application works normally, but displays garbage characters or empty boxes in place of the current symbols. As of this writing, the following fonts are known to work correctly with Unicode (although they may not support all alphabets and symbol sets, especially for less common languages like Khmer).

Windows

  • Arial (included with Windows)
  • Arial Unicode MS (included with MS Office)
  • Lucida Sans Unicode
  • Microsoft Sans Serif
  • Tahoma
  • Times New Roman

Mac OS X

  • Lucida Grande

NOTE: Mac OS X in general has been extensively localized and translated to dozens of languages and geographical regions worldwide. Unicode compatibility on Mac applications and web browsers has not been a widely reported problem, although certain alphabets and symbol sets are often omitted from some fonts.

Other

You can find a more detailed list of free and open source fonts at en.wikipedia.org/wiki/Unicode_typefaces

What Do I Do If The Standard Fonts Don't Work For My Language?

You can try the list of Unicode type faces (via Wikipedia below) as a starting point. Another good strategy is to search on the following term:

"unicode font [name of your language or symbol set] [your operating system]"
for example...
"unicode font kanji android"

"unicode typeface kanji android"

Even if you speak an obscure language, there is a good chance that a graphic designer has created a font for your language. Some of these fonts are commercially licensed, but many are available as freeware or as open source utilities.

More Information

ASCII (http://en.wikipedia.org/wiki/ASCII) : American Standard Code for Information Exchange

CJK (http://en.wikipedia.org/wiki/CJK) : Chinese/Japanese/Korean Character Sets

ISO-8859-1 (http://en.wikipedia.org/wiki/ISO_8859-1) : Extended Latin Character Set)

Unicode (http://en.wikipedia.org/wiki/Unicode : Wikipedia

Unicode Compatible Typefaces (http://en.wikipedia.org/wiki/Unicode_typefaces) : Wikipedia

UTF-8 Tutorial (http://en.wikipedia.org/wiki/UTF-8) : Wikipedia