[DUG] Upgrading to XE - Unicode strings questions

Tue Nov 23 16:34:01 NZDT 2010

As I understand it iterating over a string with Chars does get around the
problem of surrogate pairs

It depends what you mean by "get around the problem".

for c in string do WorkWith( c );

Will iterate once for each c (WIDECHAR) in s.  Some of those c's may be in
surrogate pairs, but you will get only 1 of each half of each pair at a
time.  So if your WorkWith() routine simply ignores surrogate pairs then
yes, you got around the problem.   But if WorkWith() needs to work on
discrete codepoints beyond the BMP then you have some extra work to do
before you can call WorkWith(), and you must call it with a UTF32 parameter,
NOT a UTF16 WideChar (unless WorkWith() has some way of keeping track of
calls made to it, and doing the job of combining surrogates for itself -
which is unlikely I think).

But crucially, for c in s is absolutely no different from:

for i := 1 to Length(s) do WorkWith( s[i] );

They do exactly the same thing - namely iterate over each widechar in the
string.

as any character you are currently on might be either 1,2 or more bytes if
it contains 

surrogate pairs, but just one unicode character

This makes no sense.  *Every* character (WIDECHAR) that you "are on" will be
2 bytes.  No more.  No Less.   The number of the bytes shall be 2, and 2
shall be the number.  What those 2 bytes represent may be either a complete
Unicode codepoint (in the BMP) or one of either a hi/lo char in a surrogate
pair, which must be combined to derive the codepoint they represent.

          what do  you use instead of length to get the number of characters
in the string in general?

Length(s) returns the number of WIDEChars. The number of "n" for which s[n]
is valid.

      length is not the number of characters, its the umber of 

        code-points (including surrogate pairs counted separately)  if I 

        understand correctly.

Nope - you understand incorrectly.  J

Separate issue - I understand that if one wants to iterate over the bytes of
a string 

                then one uses byte rather than char, and then one does have
to investigate each byte 

                to see if it is part of a surrogate pair.

No, this is what you have to do with WideChars in a string.  You use bytes
if you don't care about the characters at all and simply want to work with
the raw byte data.  Unlikely in the context of the questions you are asking
here, I would add.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://listserver.123.net.nz/pipermail/delphi/attachments/20101123/3980e4f6/attachment.html