Unicode troubles

Bij de ontwikkeling van de Unicode standaard zien we steeds vaker megalomane trekjes verschijnen. Recent woedde een felle discussie over hoe Unicode nu eigenlijk gezien moet worden en of er niet erg veel misverstanden over bestaan. Een uitgebreid artikel over Unicode in een voornaam Engels nieuwsblad recent bleek - zoals duidelijk werd in de discussie van de Unicode discussiegroep op het internet - geheel gebaseerd te zijn op informatie uit 1991....

Ietwat meer actuele informatie geeft aan dat het aantal van een nummertje te voorziene teken toch al ver boven de 65536 is gekomen. Wie het zich nog herinnert: met 7 bits waren 128 tekens vast te leggen [de beruchte ASCII standaard van de Amerikanen], met 8 bits 256 [extended ofwel uitgebreid ASCII met -tig locale varianten] en 16 bits ofwel 65536. Het is de 16-bits variant die schoorvoetend is doorgedrongen tot de huidige PC-gebruiker. M$ Word slaat standaard de tekst op met niet 8 maar 16 bits en zou dus alle 65536 verschillend 'code points' kunnen vastleggen.

De weerstand tegen de 'verdubbeling' bij tekstopslag en data-verkeer had al geleid tot het verzinnen van fraaie technieken waarbij in ieder geval de Westerse computer-gebruiker geen last zou hebben van een 'verdubbeling' dankzij de zogenaamde Unicode Transformation Formats of UTF - met name UTF 7 [levert 7 bits op voor de Amerikanen, nu eigenlijk niet meer acceptabel ] en UTF 8 [levert 8-bits op voor de Amerikanen]. En met zulke 'compressies' is de grens van 65536 al niet meer relevant. We kunnen rustig doorgaan en elk reeds - ergens, ooit, door iets of iemand - vastgelegd 'teken' moet een plekje in de lange rij van 'code points' kunnen krijgen.

Ken Wistler probeerde een schatting te geven van het aantal nog niet toegekende 'code points' en het aantal al reeds wel toegekende....

BTW, if anyone was wondering where I came up with the figure 880,325 reserved unassigned code points for Unicode 3.1, here are the complete statistics for Unicode 3.0 and Unicode 3.1:

Unicode: U 3.0 U 3.1 BMP * Alphas/Symbols 10236 10238 Suppl Alphas/Symbols 1691 Han (URO) 20902 20902 Han (Ext A) 6582 6582 Han (Ext B) 42711 Han Compat 302 302 Suppl Han Compat 542 Hangul Syllables 11172 11172 Subtotal 49194 94140 BMP Private Use 6400 6400 Suppl Private Use 131068 131068 Surrogate Code Points 2048 2048 Controls 65 65 BMP Noncharacters 2 34 Suppl Noncharacters 32 32 BMP Reserved 7827 7793 Suppl Reserved 917476 872532

The total number of code points accounted for here is 1,114,112 (= 17 x 64K), i.e. U+0000..U+10FFFF.

* BMP = Basic Multilanguage Plane met elke plane bevattende 64K = 65536 tekens.

De 'code points' werden hexadecimaal aangegeven met U+0000 tot U+FFFF, ofwel de oude situatie met 16 tot de 4e macht = 65536. In de nieuwe situatie [van Unicode versie 3.1] is de grens verlegd naar U+10FFFF wat geen macht meer is van 16 en dus ook niet meer kan worden gezien als vast te leggen met een geheel aantal bits. Méér dus dan 16 maar nog lang geen 32 bits, zelfs geen 24...

De discussie begon toen iemand Unicode 3.1 uitmaakte voor een '16-bit character encoding standard' en Marco Cimarosti daar op insprong:

Doug Ewell wrote: > "A 16-bit character encoding standard [...] > By contrast, 8-bit ASCII [...]

These two statements are regularly found together, but it is the second one that makes me despair.

If nearly half a century was not enough time for people to learn that ASCII is a 7-bit encoding, how long will it take to fix the Unicode misconception? Moreover, the analogy between the two statements above is illusory, the Unicode misconception being much bigger than the ASCII one. In fact, it *does* make sense to say that "ASCII is an n-bit encoding". The only problem is that the correct value for n is 7, not 8. But in the case of Unicode it is not possible to change "16" with the correct number, because there is no correct number!

When I tried fighting the 16-bit misconception, I found myself involved in a long explanation (versions, surrogates, UTF's, how many Chinese ideographs...), at the end of which my interlocutors normally ask: "So, how many bits does it have?"

How about considering UTF-32 as the default Unicode form, in order to be able to provide a short answer of this kind:

"Unicode is now a 32-bit character encoding standard, although only about one million of codes actually exist, and there are ways of representing Unicode characters as sequences of 8-bit bytes or 16-bit words."

Marco

Over de term 'ASCII' doet Edwin Hart nog een duit in het zakje:

I am unsure if "8-bit ASCII" is a well-defined term. "ASCII" implies ANSI X3.4-1986 and the 7-bit ASCII code. It was my intention for ISO/IEC 8859-1 to be the 8-bit ASCII standard. When the US adopted ISO 8859-1 as a US standard (ANSI/ISO 8859-1), as editor I asked ANSI to add "(8-bit ASCII)" to the end of the title. I never purchased a copy to see if ANSI did this.

Edwin F. Hart

Ook anderen grijpen ver terug in hun herinnering:

Doug Ewell:

I wrote:

>> Even 8-bit ASCII is a correct term meaning ISO-8859-1. > > I would question that. Understandable, yes, but not really correct.

jcowan@reutershealth.com wrote:

> No, it *is* correct. ANSI X.3 (which has a new name these days) in fact did define an 8-bit American Standard Code for Information Interchange, being exactly the same as ISO 8859-1. > > Of course, that does not affect the definition of the 7-bit American Standard Code.

Meanwhile, roozbeh@sharif.edu wrote:

> In the computer culture I grew up, 8-bit ASCII meant CP437. Every author called the CP437 table that was available at the end of computer books the ASCII table.

And perhaps the Mac people think of MacRoman as "8-bit ASCII." The 8-bit extensions to ASCII are just that, extensions - they are not ASCII. Even ISO 8859-1 cannot be called "the" 8-bit ASCII - if it were, there would be no need for ISO 8859-2, -3, -4, -5, etc.

Of course, it could be worse. Ten years ago, one of the WordPerfect experts at my work had a name for all those strange Greek, Cyrillic, box-drawing and happy-face characters that were listed in Appendix Z of the manual and had to be entered with a special {x, y} key sequence. She called them "the ASCII characters."

-Doug Ewell Fullerton, California

Waar Doug op doelt zijn de bij WordPerfect 5.x ruim 1700 verschillende tekens die ondergebracht waren in 'tekensets': 0 voor 'ASCII', 1 voor 'Multinational' , 11 voor 'Japans Kana' en 12 voor 'user definable'. Een voor die tijd ongekende luxe en revolutionair!! Unicode bestond toen nog niet - pas vanaf 1991...

Uit Polen komt de volgende vraag:

> A little out of date, but describing correctly the state of art in 1991 > before the merger. Even 8-bit ASCII is a correct term meaning ISO-8859-1.

What were/are the reasons to refer to ISO 8859-1 as 8-bit ASCII?

Best regards

Janusz S. Bieñ

Prof. Janusz S. Bien, Warsaw Uniwersity http://www.orient.uw.edu.pl/ jsbien/

Waarop Ed Hart:

Two reasons:

By specifying an 8-bit ASCII standard, the US then invalidated any vendor/proprietary code being called "8-bit ASCII".
It also distinguished between the 7-bit X3.4-1986 ASCII and an 8-bit extension.

Ed Hart

En uit Japan komt Joel Rees, die duidelijk vanuit het programmeurs-standpunt bezwaren ziet tegen het alsmaar uitdijend Unicode-heelal en de noodzaak omdat ook nog te ondersteunen in de software:

What exactly _would_ be wrong with calling UNICODE a thirty-two bit encoding with a couple of ways to represent most of the characters in a smaller number of bits? From a practical perspective, that would seem the most correct and least misleading to me. (For example, no one writing a character handling library is going to try to declare a 24 bit type for characters, and no one writing a real character handling library is going to try to build flat monolithic classification tables for the whole 90,000+ in the current set anyway.)

I do realize that some managers at (particularly) Sun and Microsoft are probably still feeling a little like they've got pie in the face because their wonderful 16 bit character types turned out not to be as simple a solution as they claimed they would.

Btw, saying approximately 20.087 bits (Am I calculating that right - log2[ 17*65536 ]?) causes many people to think they are just being teased.

Now I happen to be of the opinion that the attempt to proclaim the set closed at 17 planes is a little premature. It's the newby in me, I'm sure, but I still remember that disconcerted feeling I got when my freshman Algebra for CS teacher pointed out that real character sets are by principle not subject to closure - something like the churning in the stomach I got when thinking of writing a program that would fill more than 64K of memory. :)

Joel Rees, Media Fusion KK Amagasaki, Japan

En Thomas Lord verdient het wat mij betreft om in deze discussie het laatste woord te hebben:

Here is a chapter of a reference manual I've been working on. The original manual can be found at http://www.regexps.com, along with some useful Unicode software (a fast regular expression matcher, a database for C, and some handy data structures).

The manual as a whole is covered by the GNU Free Documentation License, but the plain-text version in this message may be reproduced unconditionally.

Thomas Lord
regexps.com

Absurdly Brief Introduction to Unicode

copyright 2001, Thomas Lord, regexps.com, Pittsburgh PA
Permission is granted to reproduce this text verbatim, without
further restrictions, except that this copyright notice and
permission statement must be included. Permission is
granted to reproduce this text with modifications, provided
that this copyright notice and permission statement are
included, and the copy is clearly marked as "modified from
the original".

This chapter is a very succinct introduction to the Unicode character set. It may be useful when trying to read this manual, but it is not intended to be a thorough introduction. One place to learn more about Unicode is the web site of the Unicode Consortium . The current definition of Unicode is published as The Unicode Standard Version 3.0 by the Unicode Consortium.

Characters

Unicode defines a set of _abstract_characters_. Roughly speaking, abstract characters represent indivisable marks that people use in writing systems to convey information. In western alphabets, for example, latin small letter A is the name of an abstract character. That name doesn't refer to a in a particular font, but rather to the idea of small A in general.

Unicode includes a number of abstract characters which are formatting marks: they give an indication of how adjacent characters should be rendered but do not themselves correspond to what one might ordinarily think of as a "written character".

Unicode includes a number of abstract characters which are control characters: they have traditional (and sometimes standard) meaning in computing, but do not correspond to any feature of human writing.

Unicode includes a number of abstract characters which are usually combined with other characters (such as diacritical marks and vowel marks).

The goal of Unicode is to encode the complete set of abstract characters used in human writing, sufficient to describe all written text.

The situation is complicated by three factors: the necessarily large size of a global character set; the occasionaly arbitrary decisions that must be made about what counts as an abstract character and what does not; and the generally acknowledged desirability of supporting bijective mappings between a variety of older character sets and subsets of Unicode.

Code Points

A _code_point_ is an integer value which is assigned to an abstract character. Each character receives a unique code point.

By convention, code points are always written in hexadecimal notation, prefixed by the string U+. Usually, no less than four hexadecimal digits are written.

Unicode code points are in the closed range U+0000..U+10FFFF. Thus, it requires at least 21 bits to hold a Unicode code point. Sometimes people say that "Unicode is a 16-bit character set.": that is an error.

There are (now and for the forseeable future) many more code points than abstract characters. Revisions to Unicode add new characters and, sometimes, recommend against using some old characters, but once a code point has been "assigned", that assignment never changes.

Some Special Code Points

Unicode code points U+0000..U+007F are essentially the same as ASCII code points.

Unicode code points U+0000..U+00FF are essentially the same as ISO 8859-1 code points ("Latin 1").

Two code points represent non-characters. These are U+FFFE and U+FFFF. Programs are free to give these values special meaning internally.

The code point U+FEFF is assigned to the formatting character "zero-width no-break space". This character has a special significance when it occurs in certain serialized representations of Unicode text. This is described in the next section.

Code points in the range U+D800..U+DFFF are called _surrogates_. They are not assigned to abstract characters. Instead, they are used in pairs as one way to represent a code point in the range U+10000..U+10FFFF. This is also described in the next section.

Encoding Forms

If Unicode code points occupy 21-bits of storage, how is a string of Unicode text represented? There are two recommended alternatives called UTF-8 and UTF-16. Collectively, systems of representing strings are known as _encoding_forms_.

The definition of an encoding form consists of a _code_unit_ (an unsigned integer type with a fixed number of bits, usually fewer than 21 ) and a rule describing a bijective mapping between code points and sequences of code units. UTF-8 uses 8-bit code units. UTF-16 uses 16 bit code units.

In UTF-8, code points in the range U+0000..U+007F are stored in a single code unit (one byte). Other code points are represented by a sequence of two or more code units, each byte in the range 80..FF. The details of these multi-byte sequences are available in countless Unicode reference materials.

In UTF-16, code points in the range U+0000..U+FFFF are stored in a single 16-bit code unit. Other code points are represented by a pair of surrogates, each stored in one code unit. Again, the details of multi-code-unit sequences are readily available elsewhere.

Not every sequence of 8-bit values is a valid UTF-8 string. Not every sequence of 16-bit values is a valid UTF-16 string. Strings that are not valid are called "ill-formed".

When stored in the memory of a running program, UTF-16 code units are almost certainly stored in the native byte order of the machine. In files and when transmitted, two byte orders are possible. When byte order distinctions are important, the names UTF-16be (big-endian) and UTF-16le (little-endian) are used.

When a stream of text has a UTF-16 encoding form, and when its byte order is not known in advance, it is marked with a byte order mark. A byte order mark is the formatting character "zero-width no-break space" (U+FEFF ) occuring as the first character in the stream. By examining the first two bytes of such a stream, and assuming that those bytes are a byte order mark, programs can determine the byte-order of code units within the stream. When a byte order mark is present, it is not considered to be part of the text which it marks.

Another encoding form has been standardized that may become popular in the future: UTF-32. In UTF-32, code units are 32 bits and each code point is stored in a single code unit.

Character Properties

In addition to naming a set of abstract characters, and assigning those characters to code points, the definition of Unicode assigns each character a collection of _character_properties_.

The possible properties a character may have and their meanings are too numerous to list here. Three examples are:

general category - such as "lowercase letter", "uppercase letter", "decimal digit", etc.

decimal digit value - if the character is used as a decimal digit, this property is its numeric value.

case mappings - the default lowercase character corresponding to an uppercase character, and so forth.

The Unicode consortium publishes definitions of various character properties and distributes text files listing those properties for each code point. For more information, visit http://www.unicode.org.