Unicode troubles [wordt vervolgd]

In de vorige aflevering heb ik laten zien hoe de Unicode gemeenschap de 16-bits code vaarwel heeft gezegd. Nieuwe schriften worden de laatste tijd veelvuldig voorgesteld voor codering waarbij je wel eens afvraagt of er echt bij stil wordt gestaan hoe nuttig het wel is.

Het "Klingon" schijnt het voorlopig niet te halen maar het "Tengwar" resp. het "Cirth", en recent het "Shavian" resp. "Nushu" gooien hoge ogen:

Tengwar en Cirth hebben iets met J.R Tolkien van doen, Shavian met George Bernhard Shaw resp. de mormonen van Brigham Young en Nushu [nieuwe schoenen] met een Chinees vrouwen-taaltje, oneerbiedig gezegd en gezwegen.

[Ken Whistler]

Some, like Tengwar, have taken a somewhat different path. Tolkien constructed it for aesthetic and literary purposes, and certainly never had the intent of someone like Shaw, to use it for the reform or replacement of an existing orthography. However, unlike Shavian, Tengwar has had a kind of organic success of a sort, spreading in its aesthetic and literary realm, and gaining a group of adherents. The fact that Tengwar is used to express a language that itself was also consciously constructed does not, as I see it, render it any less suitable for the purposes it is intended and used. After all, the Latin script is also used to express constructed languages such as Esperanto. I see no *moral* distinction here, even if Tengwar is more often put to the purpose of writing romantic nature poetry, whereas Esperanto tends to discussions of world government. ;-)

And as I have said before, I see nothing inherently less worthwhile in a well-constructed Elvish poem expressed in Tengwar than in a warehouse record from Uruk expressed in Sumero-Akkadian cuneiform.

[Doug Ewell]

I do feel that there is a difference between:

(a) scripts like Shavian and Deseret, which were invented in a completely serious vein, in an attempt to provide an alternative and presumably better means of writing a real language, but didn't quite catch on; and

(b) truly "fictional" scripts like Klingon, Tengwar, Cirth, and such that appear in novels or TV or movies and were never intended to be used seriously.

Both G.B. Shaw and the Mormons had genuine, if not universally shared, reasons for wanting to abandon the Latin script for writing English in favor of something "better." Shaw thought English literacy could be improved with a more regular writing system to take the place of the convoluted Latin-based orthography. (There are also rumors of darker motives, but the intent was still for serious use.) Brigham Young wanted to isolate the Mormons from the rest of the "corrupt" world of written English.

[Tom Emerson, senior Sinostringologist]

William Chiang(*) provides a table of some 1535 nushu glyphs, with possible hanzi originals and hanzi glosses for each nushu glyph.

His table is built from 141 documents covering 13 registers and containing some 107K glyphs. The oldest texts analyzed date from the beginning of the 20th century: older texts are much more difficult to come by because a woman's writings were often burned after her death. The texts range from a number of different authors.

I don't have a break down of the number of glyphs and each variant, though I've been meaning to do this for a while.

There is still active research being done on cataloging and categorizing the nushu writings, though this information is available only in Chinese, is published only in China, and has not been computerized.

I've been meaning to develop a proposal for nushu for a while, but have been stalled creating the fonts.

(*) Chiang, William W. "We Two Know The Script; We Have Become Good Friends: Linguistic and Social Aspects of The Women's Script Literacy in Southern Hunan, China". University Press of America, 1995.

Een ander heftig punt van discussie in de Unicode mailing-list was dat het bedrijf Oracle voor haar database programma een nieuwe Unicode Transformation Format in gebruik had genomen [UTF-8s] dat compatible was met eerdere UTF's echter geen rekening hield met de zogenaamde 'surrogates' ofwel 'code points' die niet dienen voor een bepaald karakter doch slechts een adressering mogelijk maken voor de 'hogere regionen':

Code points in the range U+D800..U+DFFF are called _surrogates_. They are not assigned to abstract characters. Instead, they are used in pairs as one way to represent a code point in the range U+10000..U+10FFFF.

Het voorstel voor een UTF-8S riep heel wat commotie op:

Juliusz Chroboczek:

Dear all,

In the discussion about UTF-8S, there is one point that has not been mentioned (or else I missed it).

Most people seem to be arguing from the point of view of users and developers on platforms on which Unicode is well-established as the default encoding. On Unix-like systems, however, ISO 2022-based encodings are still alive and kicking. Hard.

One of the main arguments in favour of using Unicode on such platforms is that it leads to a world in which there is only one encoding, both for the user and the developer. The multiplication of UTFs, however, not only breaks this model, but also leads to much confusion. (Heck, many users still think that UTF-8 and Unicode are two completely unrelated encodings! Try explaining to them that UTF-16 is Unicode too!)

I have tried to point this out when IANA were introducing UTF-16-BE and other monstruosities, only to be treated in a rather patronising manner by some of the respectable members of this list (``Juliusz's confusion can be explained by...`). Folks, from a user's perspective, UTF-8 and UTF-16 are two different encodings. Please don't make the situation worse than it already is. Don't create any more UTFs.

Whatever happens, we will continue to promote signature-less UTF-8 as the only user-visible encoding, and signature-less UTF-8 (mb) and BOM-less UCS-4 (wc) as the only programmer-visible ones. The more UTFs the Unicode consortium legitimises, the more explaining we'll have to do that ``this is just a version of Unicode used on some other platforms, please convert it to UTF-8 before use.'

Oracle:

As matter of fact, Oracle supported UTF-8 far earlier than surrogate or 4-byte encoding was introduced. As database vendor, Oracle took fully advantages of Unicode and also a victim of Unicode in sense of compatibility. As no burden of fonts and IME issue for a database to store Unicode at its server. Oracle supported very early version of Unicode in its Oracle 7 release as database character set AL24UTFFSS which means 3-byte encoding for UTF-FSS. When Unicode came to version 2.1, we found our AL24UTFFSS had trouble for 2.1 as Hangul's reallocation, and we could not simply update AL24UTFFSS to 2.1 definition as it would mess existing users' data in their database. So we came up with a new character set as UTF8 which is still 3-byte encoding to support Unicode 2.1. The choice of 3-byte encoding is also bound to AL24UTFFSS implementation as it would not break when users migrate AL24UTFFSS into UTF8.

In 9i release, we cannot make an easy expansion for UTF8 up to 4-byte for the backward compatibility. Although we specifically document that UTF8 does not support supplementary character in 8i, but users can still input surrogate through UCS-2 into UTF8 database as a pair of 3-byte ( this is true to other database vendors ), which will make hard for us to simply change UTF8 definition up to 4-byte. If we did this simple update, a pair of surrogates from 8i UTF8 database would be stored into 9i UTF8 without character set conversion, resulting in irregular forms in AL32UTF8, which could make migration even harder as there would be two different versions of UTF8 in a distributed system. So what we did in Oracle 9i is to introduced a new character set as AL32UTF8 for the standard UTF-8 up to 4-byte encoding, and user can easily migrate UTF8 to AL32UTF8 either in a database version migration or in a distributed environment.

People may argue that as there is no supplementary character defined before Unicode 3.1, it should be ok to simply update UTF8 to support 4-byte encoding without compatibility issue, but the case is not because we cannot force every Oracle customers to migrate their database into 9i, which means there is still a certain time period that Oracle 8i and 9i would be co-exist. You have to consider their compatibility and that's the price we have to pay to support Unicode.

Regards, Jianping.

Over the last few days, this email thread has generated many interesting discussions on the proposal of UTF-8s. At the same time some speculations have been generated on why Oracle is asking for this encoding form. I hope to clarify some of these misinformation in this email.

In Oracle9i our next Database Release shipping this summer, we have introduced support for two new Unicode character sets. One is 'AL16UTF16' which supports the UTF-16 encoding and the other is 'AL32UTF8' which is the UTF-8 fully compliant character set. Both of these conform to the Unicode standard, and surrogate characters are stored strictly in 4 bytes. For more information on Unicode support in Oracle9i , please check out the whitepaper "The power of Globalization Technology" on http://otn.oracle.com/products/oracle9i/content.html

The requests for UTF-8s came from many of our Packaged Applications customers (such as Peoplesoft , SAP etc.), the ordering of the binary sort is an important requirement for these Oracle customers. We are supporting them and we hope to turn this into a TR such that UTF-8s can be referenced by other vendors when they need to have compatible binary order for UTF-16 and UTF-8 across different platforms.

The speculation that we are pushing for UTF-8s because we are trying to minimize our code change for supporting surrogates, or because of our unique database design are totally false. Oracle has a fully internationalized extensible architecture and have introduced surrogate support in Oracle9i. In fact we are probably the first database vendor to support both the UTF-16 and UTF-8 encoding forms, we will continue to support them and conform to future enhancements to the Unicode Standard.

Regards Simon Law

Het argument van Oracle dat een nieuwere versie van Unicode hen in de problemen had gebracht hield niet stand:

The Hangul mess took place with Unicode 2.0, not 2.1. And this is a red herring anyway when we are talking about UTF-8. As stated before, UTF-8 has never changed even though the Unicode beneath it has changed:

* by moving the Hangul block in version 2.0 * by creating the UTF-16 mechanism to support surrogates in 1993 (not 2001)

The mechanism in UTF-8 to encode characters from U+10000 to U+10FFFF (actually U+1FFFFF) in 4 bytes was part of the original FSS-UTF specified in 1992. Check the records. It was never "added on" at some later date, causing existing conformant UTF-8 to break. If Oracle or any other vendor or developer originally interpreted UTF-8 to use a maximum of 3 bytes to encode a character, that is either their own misreading of the specification or a deliberate subsetting of the problem, but in any case that company cannot claim to be a "victim of Unicode" when they have implemented a clearly specified Unicode standard incorrectly.

-Doug Ewell Fullerton, California

En ook Oracle zelf nam gas terug:

Markus Scherer wrote:

> > This means that Oracle mis-implemented the UTF-8 standard as it was specified at that time, starting at least with Unicode 2.0. UTF-8 as a part of the ISO/Unicode standards encoded UCS-4 units=Unicode scalar values in up to 6 bytes. Unicode scalar values reached up to 0x10ffff, which required 4-byte UTF-8. This was with Unicode 2.0 in 1996. > According to the above, Oracle implemented its new "UTF8" with the intention of implementing Unicode 2.1, and did not, in fact, follow the then-specifications. >

No, Oracle does not mis-implement the UTF-8 standard but only limit its support to BMP only. Except the backward compatibility reason, Oracle also needed to be compatible with other database vendors such as IBM and Sybase's UTF-8 support up to BMP only and Microsoft SQL Server Unicode support to UCS2 then.

Regards, Jianping <Jianping.Yang@oracle.com>

Carl Brown:

I agree with you, the problem is that the D800 to DFFF codes were never defined as valid Unicode characters. Encoding these into ED xx xx codes has never produced valid Unicode code points in UTF-8. Thefore any of these codes in the database were never valid Unicode characters at any point in the Unicode standard. As a consequence there is no backwards compatibility issue.

Kort en goed. Dus ook hier weer slachtoffers van de megalomanie, die niet nodig was geweest als er wat bescheidener was gedacht.