[DUG] Upgrading to XE - Unicode strings questions

Wed Nov 24 04:27:07 NZDT 2010

John,

I think you are confusing Canonical & Normalized versions of the same Unicode string (in the example s1 is canonical, s2 is normalized) and the effect of local codepage conversion.

Windows-1252 codepage (latin ISO 8859-1) has support for characters like the "ö" (ascii code #246) and "é" (ascii code #130). Converting to ansistring/ansichar on your system will take care of canonical Unicode representation and hence return true if you compare those strings. Please note that this only works because your system is set to a latin based codepage ... do the same on a Japanese version of windows and you'll get a very different result as there is no support for "ö" in ansistring under Japanese codepage! Because your system is Latin your first testcase/example of you building the word "finance" should actually work without problems - Joylon/Cary are probably wrong if they indeed implied that this wouldn't work.

The "ö" can be written as a compound #$006F + #$0308 in canonical format ... and as #$00f6 in the "normalized" format. For most normal applications it just doesn't really matter either way because a user that is inputting text under his local codepage will always do it the same way and hence chances of you encountering a mix between canonical/normalized version will be close to zero. You only ever get issues if you cross codepage boundaries (like for example if you have users in different countries storing data in a database - which is why international databases often use UTF-8 to store data instead of their native charactersets). Most of the better databases (like for example Oracle) have built in support for sorting and handling canonical format and do the conversion automatically for you  ... for someone writing desktop applications it usually just isn't an issue either way. 

Kind Regards,
Stefan Mueller 
_______________________
R&D Manager
ORCL Toolbox LLP, Japan
http://www.orcl-toolbox.com 

-----Original Message-----
From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz] On Behalf Of John Bird
Sent: Tuesday, November 23, 2010 7:33 PM
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Iterating over a string is for the purpose of doing something with each 
individual character......whether it is a ‘A’   or a 'A' with a ^ (caret) on 
top of it.   When I said the number of bytes in a character varies I was not 
meaning the number of bytes in a Char - I was meaning the total number of 
bytes in a one resulting character or letter might vary.   For instance the 
word fiancee  (with an acute on the last e) has 7 characters, the last of which might be 2 code units

When I iterate over a string I ideally want to get one character in the word each time:

could I build a string like this?

setlength(String1,7);
string1[1] := 'f';
string1[2] := 'i';
string1[3] := 'a';
string1[4] := 'n';
string1[5] := 'c';
string1[6] := 'e';
string1[7] := 'e';            //I would want the full e acute here

hence I want to be able to go

    for i :=1 to length(string1) do
    begin
            thisChar:=string1[i];        //get each character one at a time
            listbox1.items.add('i=' + inttostr(i)+'  character at position i = ' +ThisChar;
    end

I would be expecting to see 7 characters, 7 lines in the list box, and length=7,  with the last being e acute.
Now everything Jolyon  are saying and Cary also implies that this is not 
going to work.   This looks to be a real nuisance!

Now I think the e acute could be one unicode character (as there is likely to be a representation using one character, one code point and one code
unit) or as one character, two code units, 2*2 bytes - a surrogate pair - 
where eg one supplies the e and one the acute.   So it looks like what I see 
might vary according to how the e acute is encoded in the string?

As I read further this gets murkier, as some of the things Cary Jensen says are not the same as what you say even if you say it emphatically!

This is why I am thinking we have to understand clearly Unicode, and the Windows implementation of it.....and I don't really yet.

Here is what Cary Jensen says about a similar example with 7 characters, one of which is a surrogate pair:

"
Although there are 7 characters in the printed string, the UnicodeString contains 8 code units, as returned by the Length function. Inspection of the 6th and 7th elements of the UnicodeString reveal the high and low surrogate values, each of which are code units.
And, though the size of the UnicodeString is 16 bytes, ElementToCharLen accurately returns that there were a total of 7 code points in the string.
While these answers suffice for surrogate pairs, unfortunately, things are not exactly the same when it comes to composite characters. Specifically, when a UnicodeString contains at least one composite character, that composite character may occupy two or more code units, though only one actual character will appear in the displayed string. 
Furthermore,
ElementToCharLen is designed specifically to handle surrogate pairs, and not composite characters.
Actually, composite characters introduce an issue of string normalization, which is not currently handled by Delphi's RTL (runtime library). When I asked Seppy Bloom about this, he replied that Microsoft has recently added normalization APIs (application programming
interfaces) to some of the latest versions of Windows, ® including Windows® Vista, Windows® Server 2008, and Windows® 7.

Seppy was also kind enough to offer a code sample of how you might count the number of characters in a UnicodeString that includes at least one composite character. I am including this code here for your benefit, but I must offer these cautions. 
First, this code
has not been thoroughly tested, and has not been certified. If you use it, you do so at your own risk. Second, be aware that this code will not work on pre-Windows XP installations, and will only work with Windows XP if you have installed the Microsoft Internationalized Domain Names (IDN) Mitigation APIs 1.1."

http://www.embarcadero.com/images/dm/technical-papers/delphi-unicode-migration.pdf

Elsewhere he implies that Delphi can handle normalised strings for comparisons if one is careful, as in

var
s1, s2: String;
begin
ListBox1.Items.Clear;
s1 := 'Hell'#$006F + #$0308' W'#$006F + #$0308'rld';            //make using 
surrogate pairs
s2 := 'Hellö Wörld';
ListBox1.Items.Add(s1);
ListBox1.Items.Add(s2);
ListBox1.Items.Add(BoolToStr(s1 = s2, True)); ListBox1.Items.Add(BoolToStr(AnsiCompareStr(s1, s2) = 0, True)); The contents of ListBox1 are shown in the following figure.

Hellö Wörld
Hellö Wörld
False
True

Now I am not sure if the above example will show properly in email - because email text is generally limited to the ASCII characters and lists like this 
usually also restrict to text and not HTML emails.   So as a related 
exercise I am curious whether the above example prints OK on the list......the words  hello and world should have umlaut  (..) over each o in case it doesn't arrive like that on the list.

John

As I understand it iterating over a string with Chars does get around the problem of surrogate pairs

It depends what you mean by “get around the problem”.

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi at delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: unsubscribe