[DUG] Upgrading to XE - Unicode strings questions
Jolyon Smith
jsmith at deltics.co.nz
Wed Nov 24 09:39:01 NZDT 2010
John, the problem is that in Unicode "single character" is meaningless unless you have performed some pre-processing to GIVE that term some meaning. There are some standard forms for such processing, called "Normalisations".
The problem is that a single "character" to your eyes, e.g. an accented "a", could be represented in a Unicode string in at least two ways:
1. A single codepoint represented that accented "a"
2. TWO codepoints - the first representing "a" and the
second a diacritic codepoint for the accent
> Iterating over a string is for the purpose of doing something with each
> individual character
That's fine, but in Unicode what you have is a string not of characters but of codepoints. The concept of a "character" is not synonymous with "codepoint" in Unicode in the same way that it is with ASCII or even ANSI.
So you have compounded complications:
a. Depending on encoding, a single codepoint (32-bit value)
may be encoded in 1, 2, or more bytes. Each byte may
represent a whole codepoint or only part of a codepoint
encoding.
b. Each codepoint may represent a whole character or only
PART of a character encoding.
Complication 'a' can be avoided by adopting UTF-32 encoding - 4 bytes for EVERY codepoint. That is hugely wasteful in terms of memory/storage for most applications. UTF-16 - the encoding used by Delphi and indeed by Windows natively itself - is a compromise. It is less efficient than ANSI for ASCII, but more efficient that UTF-32 for ANSI characters sets represented in the BMP.
For applications working entirely in the BMP UTF-16 is also relatively easy to process - for NORMALISED strings, each codepoint IS a character (in the BMP). But for non-normalised data that is still not necessarily the case.
> could I build a string like this?
> setlength(String1,7);
> string1[1] := 'f';
> string1[2] := 'i';
> string1[3] := 'a';
> string1[4] := 'n';
> string1[5] := 'c';
> string1[6] := 'e';
> string1[7] := 'e'; //I would want the full e acute here
Yes, you can.
But you might also *receive* from another source, a string that is apparently the same at the visual representation level, but different at the data level, where:
string1[1] = 'f';
string1[2] = 'i';
string1[3] = 'a';
string1[4] = 'n';
string1[5] = 'c';
string1[6] = 'e';
string1[7] = 'e'; // Normal 'e' character, i.e. identical to string1[6]
string1[8] = U+0301; // Combining acute diacritic
When displayed on screen this string will appear identical to your string, but it is represented in the data in a different way.
> hence I want to be able to go
> for i :=1 to length(string1) do
> begin
> ..
> end
> Now everything Jolyon are saying and Cary also implies that this is
> not going to work. This looks to be a real nuisance!
I don't know what gave you that impression from what I said.
Yes, Unicode is/can be a real nuisance - *properly* supporting it is a lot more work than people think - but what you want to do here can be done.
> Now I think the e acute could be one unicode character (as there is likely
> to be a representation using one character, one code point and one code
> unit) or as one character, two code units, 2*2 bytes - a surrogate pair -
> where eg one supplies the e and one the acute.
NO!!! This is NOT what a surrogate pair is.
A surrogate pair is encountered ONLY in UTF-16, and is found when you have a codepoint that is not in the BMP. i.e. a value > 65535 that cannot be encoded in a 16-bit value. These are typically CJVK characters (Chinese/Japanese/Vietnamese/Korean) sometimes called Han or Kanji character sets.
The first 16-bit value indicates a "page" in the non-BMP. The following 16-bit value then identifies an entry in that "page". To obtain the codepoint that the PAIR of VALUES represents, you have to apply a transform, combining the page selector with the page entry. But what you get is a single codepoint. (you don't have to do this - there are routines to do it for you, but you have to invoke them as appropriate).
A Surrogate Pair is a representation of a single codepoint, NOT a relationship between TWO codepoints.
When you have a visual character encoded as a codepoint + a following, combining codepoint, that is simply TWO Unicode codepoints that are combined to form one VISUAL "character". That is NOT a surrogate pair however. It is merely two codepoints that have to be combined.
More information about the Delphi
mailing list