[DUG] Upgrading to XE - Unicode strings questions

Wed Nov 24 09:39:01 NZDT 2010

John, the problem is that in Unicode "single character" is meaningless unless you have performed some pre-processing to GIVE that term some meaning.  There are some standard forms for such processing, called "Normalisations".

The problem is that a single "character" to your eyes, e.g. an accented "a", could be represented in a Unicode string in at least two ways:

  1.  A single codepoint represented that accented "a"

  2.  TWO codepoints - the first representing "a" and the
      second a diacritic codepoint for the accent

> Iterating over a string is for the purpose of doing something with each
> individual character

That's fine, but in Unicode what you have is a string not of characters but of codepoints.  The concept of a "character" is not synonymous with "codepoint" in Unicode in the same way that it is with ASCII or even ANSI.

So you have compounded complications:

a.  Depending on encoding, a single codepoint (32-bit value) 
     may be encoded in 1, 2, or more bytes.  Each byte may 
     represent a whole codepoint or only part of a codepoint 
     encoding.

b.  Each codepoint may represent a whole character or only 
     PART of a character encoding.

Complication 'a' can be avoided by adopting UTF-32 encoding - 4 bytes for EVERY codepoint.  That is hugely wasteful in terms of memory/storage for most applications.  UTF-16 - the encoding used by Delphi and indeed by Windows natively itself - is a compromise.  It is less efficient than ANSI for ASCII, but more efficient that UTF-32 for ANSI characters sets represented in the BMP.

For applications working entirely in the BMP UTF-16 is also relatively easy to process - for NORMALISED strings, each codepoint IS a character (in the BMP).  But for non-normalised data that is still not necessarily the case.

> could I build a string like this?

> setlength(String1,7);
> string1[1] := 'f';
> string1[2] := 'i';
> string1[3] := 'a';
> string1[4] := 'n';
> string1[5] := 'c';
> string1[6] := 'e';
> string1[7] := 'e';            //I would want the full e acute here

Yes, you can.

But you might also *receive* from another source, a string that is apparently the same at the visual representation level, but different at the data level, where:

 string1[1] = 'f';
 string1[2] = 'i';
 string1[3] = 'a';
 string1[4] = 'n';
 string1[5] = 'c';
 string1[6] = 'e';
 string1[7] = 'e';            // Normal 'e' character, i.e. identical to string1[6]
 string1[8] = U+0301;         // Combining acute diacritic

When displayed on screen this string will appear identical to your string, but it is represented in the data in a different way.

> hence I want to be able to go

>    for i :=1 to length(string1) do
>    begin
> ..
>    end

> Now everything Jolyon  are saying and Cary also implies that this is
> not going to work.   This looks to be a real nuisance!

I don't know what gave you that impression from what I said.

Yes, Unicode is/can be a real nuisance - *properly* supporting it is a lot more work than people think - but what you want to do here can be done.

> Now I think the e acute could be one unicode character (as there is likely 
> to be a representation using one character, one code point and one code 
> unit) or as one character, two code units, 2*2 bytes - a surrogate pair - 
> where eg one supplies the e and one the acute.   

NO!!!  This is NOT what a surrogate pair is.

A surrogate pair is encountered ONLY in UTF-16, and is found when you have a codepoint that is not in the BMP.  i.e. a value > 65535 that cannot be encoded in a 16-bit value.  These are typically CJVK characters (Chinese/Japanese/Vietnamese/Korean) sometimes called Han or Kanji character sets.

The first 16-bit value indicates a "page" in the non-BMP.  The following 16-bit value then identifies an entry in that "page".  To obtain the codepoint that the PAIR of VALUES represents, you have to apply a transform, combining the page selector with the page entry.  But what you get is a single codepoint.  (you don't have to do this - there are routines to do it for you, but you have to invoke them as appropriate).

A Surrogate Pair is a representation of a single codepoint, NOT a relationship between TWO codepoints.

When you have a visual character encoded as a codepoint + a following, combining codepoint, that is simply TWO Unicode codepoints that are combined to form one VISUAL "character".  That is NOT a surrogate pair however.  It is merely two codepoints that have to be combined.