[DUG] Upgrading to XE - Unicode strings questions
John Bird
johnkbird at paradise.net.nz
Tue Nov 23 16:12:11 NZDT 2010
?As I understand it iterating over a string with Chars does get around the problem of surrogate pairs, as any character you are currently on might be either 1,2 or more bytes if it contains surrogate pairs, but just one unicode character. So if one is after iterating over the characters in the string your code should be perfect.
My question is if you are not using for C in String1 do and want to use
for i:=1 to length(string1) do
what do you use instead of length to get the number of characters in the string in general? length is not the number of characters, its the umber of code-points (including surrogate pairs counted separately) if I understand correctly.
Separate issue - I understand that if one wants to iterate over the bytes of a string then one uses byte rather than char, and then one does have to investigate each byte to see if it is part of a surrogate pair. There look to be routines for this – however I am guessing most won’t be needing to do this. Fortunately!
Also – I think getting what we used to call the ASCII value of a character, or creating a character still works the same- in fact for english alphabet the codes are the same I understand? Can someone confirm. (ie the character might use 2 bytes if encoded as unicode string, but the value stored for ‘A’ is still 41 hex or 65 decimal. Which means I think that one can do
code1,code2:integer;
char1:ansichar;
char2:char;
char1:=’A’;
char2:=’A’; //unicode char 2 bytes
code1:=ord(char1);
code2:=ord(char2);
in this case I think code1=code2 ?? anyone confirm this. Of course once one goes away from English/latin 8859 characters this is no longer going to be true.
John
Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was aware of the surrogate pair issue but I wrongly assumed that this might have been taken care by the iterator implementation. I guess not.
Thanks again!
Cheers,
Colin
On 23 November 2010 13:06, Jolyon Smith <jsmith at deltics.co.nz> wrote:
Colin, the for C in loop and the for i := 1 to Length() loops are functionally identical! The only difference is that the “for in” version incurs the slight overhead of the enumerator framework invoked by the compiler and runtime magic to support that syntax.
But in neither case will the loop itself help detect/respond to surrogate pairs (a single “WideChar” is potentially only ½ the data required to form a complete “character”). The only way to reduce an iterator over a string to a simple char-wise loop, whether explicit or using enumerators, is to first convert to UTF32, the facilities for which in the Delphi RTL are <cough> rudimentary, to put it politely. Non-existent may be nearer the mark.
The precise mechanics of the loop construct used is not material to that problem.
However, just as before Unicode when most people didn’t care and just wrote code that assumed ANSI==ASCII, these days people won’t care and will write code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate pairs just as they used to ignore extended ASCII and ANSI characters.
And for most people, that will probably actually work.
J
From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz] On Behalf Of Colin Johnsun
Sent: Tuesday, 23 November 2010 14:31
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions
I won't answer everything but just on this one question:
On 23 November 2010 11:04, John Bird <johnkbird at paradise.net.nz> wrote:
Extra question:
It looks like code like
for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;
cannot be used reliably. The problems are that length(string1) looks like
it cannot be safely used - as unicode characters may include 2 codepoints
and length(string1) highlights that there is a difference between the number
of unicode characters in a string and the number of codepoints. Still
figuring out what is the best practice here, as I have quite a lot of string
routines. Should be be OK as long as the unicode text actually is ASCII.
you can use something like this:
var
C: Char;
...
for C in String1 do
begin
DoSomethingWithOneChar(C);
end;
In this case you don't need to know the index of each character, you just get the char using the for..in..do loop.
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi at delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: unsubscribe
--------------------------------------------------------------------------------
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi at delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: unsubscribe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://listserver.123.net.nz/pipermail/delphi/attachments/20101123/53537b6e/attachment.html
More information about the Delphi
mailing list