ASCII, UTF-8, ISO-8859… You may have seen these strange monikers floating around, but what do they actually mean? Read on as we explain what character encoding is and how these acronyms relate to the plain text we see on screen.
Fundamental Building Blocks
When we talk about written language, we talk about letters being the building blocks of words, which then build sentences, paragraphs, and so on. Letters are symbols which represent sounds. When you talk about language, you’re talking about groups of sounds that come together to form some sort of meaning. Each language system has a complex set of rules and definitions that govern those meanings. If you have a word, it’s useless unless you know what language it’s from and you use it with others who speak that language.
(Comparison of Grantha, Tulu, and Malayalam scripts, Image from Wikipedia)
In the world of computers, we use the term “character.” A character is sort of an abstract concept, defined by specific parameters, but it is the fundamental unit of meaning. The Latin ‘A’ is not the same as a Greek ‘alpha’ or an Arabic ‘alif’ because they have different contexts – they’re from different languages and have slightly different pronunciations – so we can say that they are different characters. The visual representation of a character is called a “glyph” and different sets of glyphs are called fonts. Groups of characters belong to a “set” or a “repertoire.”
When you type up a paragraph and you change the font, you’re not changing the phonetic values of the letters, you’re changing how they look. It’s just cosmetic (but not unimportant!). Some languages, like ancient Egyptian and Chinese, have ideograms; these represent whole ideas instead of sounds, and their pronunciations can vary over time and distance. If you substitute one character for another, you’re substituting an idea. It’s more than just changing letters, it’s changing an ideogram.
(Image from Wikipedia)
When you type something on the keyboard, or load a file, how does the computer know what to display? That’s what character encoding is for. Text on your computer isn’t actually letters, it’s a series of paired alphanumeric values. The character encoding acts as a key for which values correspond to which characters, much like how orthography dictates which sounds correspond to which letters. Morse code is a sort of character encoding. It explains how groups of long and short units such as beeps represent characters. In Morse code, the characters are just English letters, numbers, and full stops. There are many computer character encodings which translate into letters, numbers, accent marks, punctuation marks, international symbols, and so on.
Often on this topic, the term “code pages” is also used. They are essentially character encodings as used by specific companies, often with slight modifications. For example, the Windows 1252 code page (formerly known as ANSI 1252) is a modified form of the ISO-8859-1. They’re mostly used as an internal system to refer to standard and modified character encodings that are specific to the same systems. Early on, character encoding wasn’t so important because computers didn’t communicate with each other. With the internet rising to prominence and networking being a common occurrence, it has become an increasingly important of our day-to-day lives without us even realizing it.
Many Different Types
(Image from sarah sosiak)
There are plenty of different character encodings out there, and there are plenty of reasons for that. Which character encoding you choose to use depends on what your needs are. If you communicate in Russian, it makes sense to use a character encoding that supports Cyrillic well. If you communicate in Korean, then you’ll want something that represents Hangul and Hanja well. If you’re a mathematician, then you want something that has all of the scientific and mathematical symbols represented well, as well as the Greek and Latin glyphs. If you’re a prankster, maybe you’d benefit from upside-down text. And, if you want all of those types of documents to be viewed by any given person, you want an encoding that’s pretty common and easily accessible.
Let’s take a look at some of the more common ones.
(Excerpt of ASCII table, Image from asciitable.com)
- ASCII – The American Standard Code for Information Interchange is one of the older character encodings. It was originally devised based on telegraphic codes and evolved over time to include more symbols and some now-outdated non-printed control characters. It’s probably as basic as you can get in terms of modern systems, as it’s limited to the Latin alphabet without accented characters. Its 7-bit encoding allows for only 128 characters, which is why there are several unofficial variants in use around the world.
- ISO-8859 – The International Organization for Standardization’s most widely used group of character encodings is number 8859. Each specific encoding is designated by a number, often prefixed by a descriptive moniker, e.g. ISO-8859-3 (Latin-3), ISO-8859-6 (Latin/Arabic). It’s a superset of ASCII, meaning that the first 128 values in the encoding are the same as ASCII. It’s 8-bit, however, and allows for 256 characters, so it builds off from there and includes a much wider array of characters, with each specific encoding focusing on a different set of criteria. Latin-1 included a bunch of accented letters and symbols, but was later replaced with a revised set called Latin-9 which includes updated glyphs like the Euro symbol.
(Excerpt of Tibetan script, Unicode v4, from unicode.org)
- Unicode – This encoding standard aims at universality. It currently includes 93 scripts organized in several blocks, with many more in the works. Unicode works differently than other character sets in that instead of directly coding for a glyph, each value is directed further to a “code point.” These are hexadecimal values that correspond to characters but the glyphs themselves are provided in a detached way by the program, such as your web browser. These code points are commonly depicted as follows: U+0040 (which translates to ‘@’). Specific encodings under the Unicode standard are UTF-8 and UTF-16. UTF-8 attempts to allow for maximum compatibility with ASCII. It’s 8-bit, but allows for all of the characters via a substitution mechanism and multiple pairs of values per character. UTF-16 ditches perfect ASCII compatibility for a more complete 16-bit compatibility with the standard.
- ISO-10646 – This isn’t an actual encoding, just a character set of Unicode that’s been standardized by the ISO. It’s mostly important because it’s the character repertoire used by HTML. Some of the more advanced functions provided by Unicode that allow for collation and right-to-left alongside left-to-right scripting is missing. Still, it works very well for use on the internet as it allows for the usage of a wide variety of scripts and allows the browser to interpret the glyphs. This makes localization somewhat easier.
What Encoding Should I Use?
Well, ASCII works for most English speakers, but not for much else. More often you’ll be seeing ISO-8859-1, which works for most Western European languages. The other versions of ISO-8859 work for Cyrillic, Arabic, Greek, or other specific scripts. However, if you want to display multiple scripts in the same document or on the same web page, UTF-8 allows for much better compatibility. It also works really well for people who use proper punctuation, math symbols, or off-the-cuff characters, such as squares and checkboxes.
(Multiple languages in one document, Screenshot of gujaratsamachar.com)
There are drawbacks to each set, however. ASCII is limited in its punctuation marks, so it doesn’t work incredibly well for typographically correct edits. Ever type copy/paste from Word only to have some weird combination of glyphs? That’s the drawback of ISO-8859, or more correctly, its supposed inter-operability with OS-specific code pages (we’re looking at YOU, Microsoft!). UTF-8’s major drawback is lack of proper support in editing and publishing applications. Another problem is that browsers often don’t interpret and just display the byte order mark of a UTF-8 encoded character. This results in unwanted glyphs being displayed. And of course, declaring one encoding and using characters from another without declaring/referencing them properly on a web page makes it difficult for browsers to render them correctly and for search engines to index them appropriately.
For your own documents, manuscripts, and so forth, you can use whatever you need to get the job done. As far as the web goes, though, it seems that most people agree on using a UTF-8 version that does not use a byte order mark, but that’s not entirely unanimous. As you can see, each character encoding has its own use, context, and strengths and weaknesses. As an end-user, you probably won’t have to deal with this, but now you can take the extra step forward if you so choose.
Yatri Trivedi is a monk-like geek. When he's not overdosing on meditation and geek news of all kinds, he's hacking and tweaking something, often while mumbling in 4 or 5 other languages.
- Published 03/11/11