[DUG] Upgrading to XE - Unicode strings questions

Tue Nov 23 16:12:11 NZDT 2010

?As I understand it iterating over a string with Chars does get around the problem of surrogate pairs, as any character you are currently on might be either 1,2 or more bytes if it contains surrogate pairs, but just one unicode character.   So if one is after iterating over the characters in the string your code should be perfect.

My question is if you are not using   for C in String1 do and want to use   
for i:=1 to length(string1) do

what do  you use instead of length to get the number of characters in the string in general?  length is not the number of characters, its the umber of code-points (including surrogate pairs counted separately)  if I understand correctly.

Separate issue - I understand that if one wants to iterate over the bytes of a string then one uses byte rather than char, and then one does have to investigate each byte to see if it is part of a surrogate pair.  There look to be routines for this – however I am guessing most won’t be needing to do this. Fortunately!

Also – I think  getting what we used to call the ASCII value of a character, or creating a character still works the same-  in fact for english alphabet the codes are the same I understand?  Can someone confirm.   (ie the character might use 2 bytes if encoded as unicode string, but the value stored for ‘A’ is still 41 hex or 65 decimal.   Which means I think that one can do

code1,code2:integer;
char1:ansichar;
char2:char;

    char1:=’A’;
    char2:=’A’;            //unicode char 2 bytes
    code1:=ord(char1);
    code2:=ord(char2);

in this case I think code1=code2 ??  anyone confirm this.   Of course once one goes away from English/latin 8859 characters this is no longer going to be true.

John

Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was aware of the surrogate pair issue but I wrongly assumed that this might have been taken care by the iterator implementation. I guess not. 

Thanks again!
Cheers,
Colin

On 23 November 2010 13:06, Jolyon Smith <jsmith at deltics.co.nz> wrote:

  Colin, the for C in loop and the for i := 1 to Length() loops are functionally identical!  The only difference is that the “for in” version incurs the slight overhead of the enumerator framework invoked by the compiler and runtime magic to support that syntax.

  But in neither case will the loop itself help detect/respond to surrogate pairs (a single “WideChar” is potentially only ½ the data required to form a complete “character”).  The only way to reduce an iterator over a string to a simple char-wise loop, whether explicit or using enumerators, is to first convert to UTF32, the facilities for which in the Delphi RTL are <cough> rudimentary, to put it politely.  Non-existent may be nearer the mark.

  The precise mechanics of the loop construct used is not material to that problem.

  However, just as before Unicode when most people didn’t care and just wrote code that assumed ANSI==ASCII, these days people won’t care and will write code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate pairs just as they used to ignore extended ASCII and ANSI characters.

  And for most people, that will probably actually work.

  J

  From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz] On Behalf Of Colin Johnsun
  Sent: Tuesday, 23 November 2010 14:31
  To: NZ Borland Developers Group - Delphi List

  Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

  I won't answer everything but just on this one question:

  On 23 November 2010 11:04, John Bird <johnkbird at paradise.net.nz> wrote:

  Extra question:

  It looks like code like

     for i:=1 to length(string1) do
     begin
             DoSomethingWithOneChar(string1[i]);
     end;

  cannot be used reliably.   The problems are that length(string1) looks like
  it cannot be safely used - as unicode characters may include 2 codepoints
  and length(string1) highlights that there is a difference between the number
  of unicode characters in a string and the number of codepoints.   Still
  figuring out what is the best practice here, as I have quite a lot of string
  routines.   Should be be OK as long as the unicode text actually is ASCII.

  you can use something like this:

  var

    C: Char;

  ...

    for C in String1 do

    begin

      DoSomethingWithOneChar(C);

    end;

  In this case you don't need to know the index of each character, you just get the char using the for..in..do loop.

  _______________________________________________
  NZ Borland Developers Group - Delphi mailing list
  Post: delphi at delphi.org.nz
  Admin: http://delphi.org.nz/mailman/listinfo/delphi
  Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: unsubscribe

--------------------------------------------------------------------------------
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi at delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: unsubscribe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://listserver.123.net.nz/pipermail/delphi/attachments/20101123/53537b6e/attachment.html