Multilingual Fonts and Typography - A Fountain of Fonts / Meertalige fonts en Typografie - Fontavond/turen: Papyrus 7 and its UTF

Papyrus 7 and its UTF - Papyrus interne Unicode

Naar aanleiding van een discussie bij de Poolse Atari discussiegroep over de nieuwe Papyrus ben ik gaan kijken hoe de Poolse lettertekens worden opgeslagen.

After a discussion in a Polish news-group about the new Papyrus I had a closer look at how the Polish characters get stored.

letterteken / character	hexadecimal	Unicode

Aogonek	82 04	U+0104
aogonek	82 05	U+0105
Cacute	82 06	U+0106
cacute	82 07	U+0107
Eogonek	82 18	U+0118
eogonek	82 19	U+0119
Lslash	82 41	U+0141
lslash	82 42	U+0142
Nacute	82 43	U+0143
nacute	82 44	U+0144
Oacute	81 53	U+00D3
oacute	81 73	U+00F3
Sacute	82 5A	U+015A
sacute	82 5B	U+015B
Zacute	82 79	U+0179
zacute	82 7A	U+017A
Zdot	82 7B	U+017B
zdot	82 7C	U+017C

Met een beetje puzzelen volgt daar uit dat Papyrus een soort UTF-achtige wijze van codering hanteert. Met UTF wordt bedoeld de Unicode Transformation Format. Bij de Unicode / ISO 10646 notaties was oorspronkelijk uitgegaan van een 16-bit codering - in tegenstelling tot de ASCII 7-bit of de uitgebreide ASCII8-bit - met de mogelijkheid van 65536 verschillende tekens. Dit zou genoeg moeten zijn voor alle nog 'levende' talen. Maar ja, een mens is niet snel tevreden dus ook alle Quecha-knopen, muziek-notaties, verkeersborden, firma-logo's enz., enz. dienen een plek te vinden, en dan zijn 16-bits niet voldoende. Bij 16-bit wordt de benodigde ruimte voor lettertekens [we zullen het maar niet hebben over de ruimte die de andere multimediale troep opslokt!] slechts verdubbeld en dat was voor de anglosaksische wereld die eigenlijk aan 7-bits voldoende had een grove verspilling van ruimte.

It looks like Papyrus is using a UTF-like way of coding. UTF stands for Unicode Transformation Format. The Unicode / ISO 10646 notation depends [originally] on a 16-bits coding as opposed to the 7-bit ASCII or the extended 8-bits ASCII and should provide space for some 65536 different characters. More than enough one would say for all 'living' languages. Alas, nobody gets satisfied that easily so all the traffic signs, Quecha-knots, music bars, firm logos, etc, deserve a place in Unicode space! A 16-bits coding might not be sufficient then. And although a 16-bit [or 2-byte] coding only means a doubling of storage space as far as characters are concerned [the other multimedial garbage swallows so much more space], for our anglosaxon world - that just needs 7-bits - it simply is a complete waste of space.

Het Papyrus-UTF

Unicode hexa	Papyrus - UTF	Unicode decimal

00 - 7F	00 - 7F = US-ASCII	000 - 127
80 - 9F	[address ]
A0 - FF	81 00 - 81 7F	128 - 255
100 - 17F	82 00 - 82 7F	256 - 383
180 - 1FF	83 00 - 83 7F	384 - 511
200 - 27F	84 00 - 84 7F	512 - 639
280 - 2FF	85 00 - 85 7F	640 - 767
300 - 37F	86 00 - 86 7F	768 - 895
380 - 3FF	87 00 - 87 7F	896 - 1023
400 - 47F	88 00 - 88 7F	1024 - 1151
etc.
	FF 00 - FF 7F	16256 - 16383
3 bytes:
	81 80 00 - 81 80 7F	16384 - 32767
	82 80 00 - 82 FF 7F	32678 - 49151
	83 80 00 - 83 FF 7F	49152 - 65535

Papyrus hanteert een UTF waarbij de Unicode / ISO 10646 niet is opgedeeld in brokken van 64 maar in brokken van 128.

Na '65535' stopt het gewoon. Hoe ik de representatie van de 'hoge' Unicode gevonden heb? Simpel door in een HTML-file met 䀀 en soortgelijke getallen te spelen.

Papyrus zet deze HTML-file om in een Papyrus-file met vertoon van de bijbehorende lettertekens uit de Unicode-bak. Als ze niet in het font zitten dan komt er een leeg vierkantje. Bij opslag in het Papyrus-formaat wordt alles gewoon meegenomen. Hetgeen te zien is als de *.PAP file in een hexadecimaal-editor wordt genomen. De waarden boven '65535' leveren niets op althans - enkele willekeurige hogere waarden leveren toch een 3-bytes code op:

98395 wordt 82 80 15 en 99999 wordt 82 8D 1F ! Maar dit levert dus een representatie op van een waarde onder de 65536. Een niet geheel foutloze algoritme van R.O.M. logicware?

Papyrus has a UTF that divides the Unicode / ISO 10646 space in pieces of 128 [not 64].

After '65535' it stops just like that. How did I find the representation of the higher Unicode numbers? Quite easily, by writing a HTML-file containing 䀀 and similar numbers. Papyrus displays the appropiate characters and should they not be found in the available font an empty square will be visible. After storing the text as a Papyrus *.pap file it takes an hexadecimal editor to have a closer look at the bytes themselves. As could be expected values above '65535' don't have a meaning but unexpectedly some values did render a 3-byte code! '98395' becomes 82 80 15 and '99999' 82 8D 1F ! As these 3-bytes codes already represent a value below '65536' there must be a small mistake in the R.O.M. Logicware algorithm.