Doing the Unicode Twist

A while ago I had to explain to a few confused executives about why Japanese characters did not show up on some emails generated by our software. I started in on the old "ASCII is 8 bit" and "Unicode is 16 bit and can handle funny characters" but then stopped short as I realized what I was saying was not really true. I needed to brush up on my knowledge because of the localization efforts we were doing, and that led to thinking that most developers, myself included, really had this whole subject kinda messed up.

To put this in perspective, let's go back, waaaay back, to 1965 and the birth of UNIX. UNIX was written in EBCDIC, which was slowly being replaced by ASCII. ASCII used the codes from 32 to 127 to represent all the English characters in upper and lower case, numbers, periods, colons, semicolons, parentheses, tabs and spaces in a very simple and predictable format. All could be stored in 7 bits which worked just fine for converting from EBCDIC to ASCII. If you wanted to get fancy and use either buts (a byte) you had another 127 characters that included common accented letters, too. A few word processors, such as WordStar, used the eighth bit to indicate a special action, and all characters below ASCII 32 were "unprintable" and meant things like backspace, Control, and page breaks. It was easy. As long as you used English or one of the few "high bit" accents.

To support Asian languages (as well as other language characters sets not represented by the ASCII set), a method called DBCS (double byte character set) was used, wherein some characters were stored and managed as two bytes, which together meant something but separately were garbage. Yet, some characters needed only on byte, some needed two, and coding with DBCS was a pain in the butt. And each implementation of DBCS was different, to add to the fun.

Unicode was proposed in 1987 and used two bytes (16 bits) for every character including all the ASCII characters. This gave coders 65,535 different character possibilities, more than enough for every language on Earth as well as a few esoteric ones such as Klingon!

This sounds sensible until you actually look at Unicode and the standard (ISO 8859). In fact, each Unicode character does not represent a "character" such as "X" or "6". Instead, a Unicode character represents something called a "code point" which is a theoretical concept, not a "real" character. The "X" in Unicode is actually a concept which means the letter "X" in any font, size, or alphabet, different than the concepts for "x" and "Y", and yet it doesn't really mean "X" just what the concept of "X" stands for. In other words, it's not a physical manifestation of "X" but a conceptual way of using "X" in many different ways to code. I started to get a headache around this point reading the standard, but I do get what they were trying to say. I also understood that the standard Unicode representations I was used to (such as U+006C and UTF-8) were just various ways of converting the concept of Unicode to a physical letter "X".

In summary, my explaining to the execs that Unicode was a way of handling foreign character sets was wrong. Unicode is a conceptual way of dealing with complex characters which is manifested by an encoding method such as UTF-8. Or at least I think that's what it means. Maybe I need to work on my manifestations of my mental concepts a bit more.

Tim on Leadership

Musings on Management and Leadership from Tim Parker

Doing the Unicode Twist