[DUG] Upgrading to XE - Unicode strings questions
John Bird
johnkbird at paradise.net.nz
Tue Nov 23 23:32:38 NZDT 2010
Iterating over a string is for the purpose of doing something with each
individual character......whether it is a ‘A’ or a 'A' with a ^ (caret) on
top of it. When I said the number of bytes in a character varies I was not
meaning the number of bytes in a Char - I was meaning the total number of
bytes in a one resulting character or letter might vary. For instance the
word fiancee (with an acute on the last e) has 7 characters, the last of
which might be 2 code units
When I iterate over a string I ideally want to get one character in the word
each time:
could I build a string like this?
setlength(String1,7);
string1[1] := 'f';
string1[2] := 'i';
string1[3] := 'a';
string1[4] := 'n';
string1[5] := 'c';
string1[6] := 'e';
string1[7] := 'e'; //I would want the full e acute here
hence I want to be able to go
for i :=1 to length(string1) do
begin
thisChar:=string1[i]; //get each character one at a time
listbox1.items.add('i=' + inttostr(i)+' character at position i
= ' +ThisChar;
end
I would be expecting to see 7 characters, 7 lines in the list box, and
length=7, with the last being e acute.
Now everything Jolyon are saying and Cary also implies that this is not
going to work. This looks to be a real nuisance!
Now I think the e acute could be one unicode character (as there is likely
to be a representation using one character, one code point and one code
unit) or as one character, two code units, 2*2 bytes - a surrogate pair -
where eg one supplies the e and one the acute. So it looks like what I see
might vary according to how the e acute is encoded in the string?
As I read further this gets murkier, as some of the things Cary Jensen says
are not the same as what you say even if you say it emphatically!
This is why I am thinking we have to understand clearly Unicode, and the
Windows implementation of it.....and I don't really yet.
Here is what Cary Jensen says about a similar example with 7 characters, one
of which is a surrogate pair:
"
Although there are 7 characters in the printed string, the UnicodeString
contains 8 code
units, as returned by the Length function. Inspection of the 6th and 7th
elements of the
UnicodeString reveal the high and low surrogate values, each of which are
code units.
And, though the size of the UnicodeString is 16 bytes, ElementToCharLen
accurately
returns that there were a total of 7 code points in the string.
While these answers suffice for surrogate pairs, unfortunately, things are
not exactly the
same when it comes to composite characters. Specifically, when a
UnicodeString contains
at least one composite character, that composite character may occupy two or
more code
units, though only one actual character will appear in the displayed string.
Furthermore,
ElementToCharLen is designed specifically to handle surrogate pairs, and not
composite
characters.
Actually, composite characters introduce an issue of string normalization,
which is not
currently handled by Delphi's RTL (runtime library). When I asked Seppy
Bloom about this,
he replied that Microsoft has recently added normalization APIs (application
programming
interfaces) to some of the latest versions of Windows, ® including Windows®
Vista,
Windows® Server 2008, and Windows® 7.
Seppy was also kind enough to offer a code sample of how you might count the
number of
characters in a UnicodeString that includes at least one composite
character. I am
including this code here for your benefit, but I must offer these cautions.
First, this code
has not been thoroughly tested, and has not been certified. If you use it,
you do so at your
own risk. Second, be aware that this code will not work on pre-Windows XP
installations,
and will only work with Windows XP if you have installed the Microsoft
Internationalized
Domain Names (IDN) Mitigation APIs 1.1."
http://www.embarcadero.com/images/dm/technical-papers/delphi-unicode-migration.pdf
Elsewhere he implies that Delphi can handle normalised strings for
comparisons if one is careful, as in
var
s1, s2: String;
begin
ListBox1.Items.Clear;
s1 := 'Hell'#$006F + #$0308' W'#$006F + #$0308'rld'; //make using
surrogate pairs
s2 := 'Hellö Wörld';
ListBox1.Items.Add(s1);
ListBox1.Items.Add(s2);
ListBox1.Items.Add(BoolToStr(s1 = s2, True));
ListBox1.Items.Add(BoolToStr(AnsiCompareStr(s1, s2) = 0, True));
The contents of ListBox1 are shown in the following figure.
Hellö Wörld
Hellö Wörld
False
True
Now I am not sure if the above example will show properly in email - because
email text is generally limited to the ASCII characters and lists like this
usually also restrict to text and not HTML emails. So as a related
exercise I am curious whether the above example prints OK on the
list......the words hello and world should have umlaut (..) over each o in
case it doesn't arrive like that on the list.
John
As I understand it iterating over a string with Chars does get around the
problem of surrogate pairs
It depends what you mean by “get around the problem”.
More information about the Delphi
mailing list