[DUG] Upgrading to XE - Unicode strings questions

Tue Nov 23 23:32:38 NZDT 2010

Iterating over a string is for the purpose of doing something with each 
individual character......whether it is a ‘A’   or a 'A' with a ^ (caret) on 
top of it.   When I said the number of bytes in a character varies I was not 
meaning the number of bytes in a Char - I was meaning the total number of 
bytes in a one resulting character or letter might vary.   For instance the 
word fiancee  (with an acute on the last e) has 7 characters, the last of 
which might be 2 code units

When I iterate over a string I ideally want to get one character in the word 
each time:

could I build a string like this?

setlength(String1,7);
string1[1] := 'f';
string1[2] := 'i';
string1[3] := 'a';
string1[4] := 'n';
string1[5] := 'c';
string1[6] := 'e';
string1[7] := 'e';            //I would want the full e acute here

hence I want to be able to go

    for i :=1 to length(string1) do
    begin
            thisChar:=string1[i];        //get each character one at a time
            listbox1.items.add('i=' + inttostr(i)+'  character at position i 
= ' +ThisChar;
    end

I would be expecting to see 7 characters, 7 lines in the list box, and 
length=7,  with the last being e acute.
Now everything Jolyon  are saying and Cary also implies that this is not 
going to work.   This looks to be a real nuisance!

Now I think the e acute could be one unicode character (as there is likely 
to be a representation using one character, one code point and one code 
unit) or as one character, two code units, 2*2 bytes - a surrogate pair - 
where eg one supplies the e and one the acute.   So it looks like what I see 
might vary according to how the e acute is encoded in the string?

As I read further this gets murkier, as some of the things Cary Jensen says 
are not the same as what you say even if you say it emphatically!

This is why I am thinking we have to understand clearly Unicode, and the 
Windows implementation of it.....and I don't really yet.

Here is what Cary Jensen says about a similar example with 7 characters, one 
of which is a surrogate pair:

"
Although there are 7 characters in the printed string, the UnicodeString 
contains 8 code
units, as returned by the Length function. Inspection of the 6th and 7th 
elements of the
UnicodeString reveal the high and low surrogate values, each of which are 
code units.
And, though the size of the UnicodeString is 16 bytes, ElementToCharLen 
accurately
returns that there were a total of 7 code points in the string.
While these answers suffice for surrogate pairs, unfortunately, things are 
not exactly the
same when it comes to composite characters. Specifically, when a 
UnicodeString contains
at least one composite character, that composite character may occupy two or 
more code
units, though only one actual character will appear in the displayed string. 
Furthermore,
ElementToCharLen is designed specifically to handle surrogate pairs, and not 
composite
characters.
Actually, composite characters introduce an issue of string normalization, 
which is not
currently handled by Delphi's RTL (runtime library). When I asked Seppy 
Bloom about this,
he replied that Microsoft has recently added normalization APIs (application 
programming
interfaces) to some of the latest versions of Windows, ® including Windows® 
Vista,
Windows® Server 2008, and Windows® 7.

Seppy was also kind enough to offer a code sample of how you might count the 
number of
characters in a UnicodeString that includes at least one composite 
character. I am
including this code here for your benefit, but I must offer these cautions. 
First, this code
has not been thoroughly tested, and has not been certified. If you use it, 
you do so at your
own risk. Second, be aware that this code will not work on pre-Windows XP 
installations,
and will only work with Windows XP if you have installed the Microsoft 
Internationalized
Domain Names (IDN) Mitigation APIs 1.1."

http://www.embarcadero.com/images/dm/technical-papers/delphi-unicode-migration.pdf

Elsewhere he implies that Delphi can handle normalised strings for 
comparisons if one is careful, as in

var
s1, s2: String;
begin
ListBox1.Items.Clear;
s1 := 'Hell'#$006F + #$0308' W'#$006F + #$0308'rld';            //make using 
surrogate pairs
s2 := 'Hellö Wörld';
ListBox1.Items.Add(s1);
ListBox1.Items.Add(s2);
ListBox1.Items.Add(BoolToStr(s1 = s2, True));
ListBox1.Items.Add(BoolToStr(AnsiCompareStr(s1, s2) = 0, True));
The contents of ListBox1 are shown in the following figure.

Hellö Wörld
Hellö Wörld
False
True

Now I am not sure if the above example will show properly in email - because 
email text is generally limited to the ASCII characters and lists like this 
usually also restrict to text and not HTML emails.   So as a related 
exercise I am curious whether the above example prints OK on the 
list......the words  hello and world should have umlaut  (..) over each o in 
case it doesn't arrive like that on the list.

John

As I understand it iterating over a string with Chars does get around the 
problem of surrogate pairs

It depends what you mean by “get around the problem”.