<HTML><HEAD></HEAD>
<BODY dir=ltr>
<DIV dir=ltr>
<DIV style="FONT-FAMILY: 'Arial'; COLOR: #000000; FONT-SIZE: 10pt">
<DIV>As I understand it iterating over a string with Chars does get around the
problem of surrogate pairs, as any character you are currently on might be
either 1,2 or more bytes if it contains surrogate pairs, but just one unicode
character. So if one is after iterating over the characters in the
string your code should be perfect.</DIV>
<DIV> </DIV>
<DIV>My question is if you are not using <FONT face=Calibri><FONT
style="FONT-SIZE: 12pt"> for C in String1 do and want to use
</FONT></FONT></DIV>
<DIV><FONT face=Calibri><FONT style="FONT-SIZE: 12pt">for i:=1 to
length(string1) do</FONT></FONT></DIV>
<DIV><FONT size=3 face=Calibri></FONT> </DIV>
<DIV><FONT size=3 face=Calibri>what do you use instead of length to get
the number of characters in the string in general? length is not the
number of characters, its the umber of code-points (including surrogate pairs
counted separately) if I understand correctly.</FONT></DIV>
<DIV><FONT size=3 face=Calibri></FONT> </DIV>
<DIV><FONT size=3 face=Calibri>Separate issue - I understand that if one wants
to iterate over the bytes of a string then one uses byte rather than char, and
then one does have to investigate each byte to see if it is part of a surrogate
pair. There look to be routines for this – however I am guessing most
won’t be needing to do this. Fortunately!</FONT></DIV>
<DIV><FONT size=3 face=Calibri></FONT> </DIV>
<DIV><FONT size=3 face=Calibri></FONT> </DIV>
<DIV><FONT size=3 face=Calibri>Also – I think getting what we used to call
the ASCII value of a character, or creating a character still works the
same- in fact for english alphabet the codes are the same I
understand? Can someone confirm. (ie the character might use 2
bytes if encoded as unicode string, but the value stored for ‘A’ is still 41 hex
or 65 decimal. Which means I think that one can do</FONT></DIV>
<DIV><FONT size=3 face=Calibri></FONT> </DIV>
<DIV><FONT size=3 face=Calibri></FONT> </DIV>
<DIV><FONT size=3 face=Calibri>code1,code2:integer;</FONT></DIV>
<DIV><FONT size=3 face=Calibri>char1:ansichar;</FONT></DIV>
<DIV><FONT size=3 face=Calibri>char2:char;</FONT></DIV>
<DIV><FONT size=3 face=Calibri></FONT> </DIV>
<DIV> <FONT size=3 face=Calibri>char1:=’A’;</FONT></DIV>
<DIV> <FONT size=3
face=Calibri>char2:=’A’;
//unicode char 2 bytes</FONT></DIV>
<DIV> <FONT size=3
face=Calibri>code1:=ord(char1);</FONT></DIV>
<DIV> <FONT size=3
face=Calibri>code2:=ord(char2);</FONT></DIV>
<DIV><FONT size=3 face=Calibri></FONT> </DIV>
<DIV><FONT size=3 face=Calibri>in this case I think code1=code2 ?? anyone
confirm this. Of course once one goes away from English/latin 8859
characters this is no longer going to be true.</FONT></DIV>
<DIV><BR></DIV>
<DIV> </DIV>
<DIV style="FONT-FAMILY: 'Arial'; COLOR: #000000; FONT-SIZE: 10pt">John</DIV>
<DIV style="FONT-FAMILY: 'Arial'; COLOR: #000000; FONT-SIZE: 10pt">
<DIV
style="FONT-STYLE: normal; DISPLAY: inline; FONT-FAMILY: 'Calibri'; COLOR: #000000; FONT-SIZE: small; FONT-WEIGHT: normal; TEXT-DECORATION: none"><FONT
size=2 face=Arial></FONT></DIV> </DIV>
<DIV
style="FONT-STYLE: normal; DISPLAY: inline; FONT-FAMILY: 'Calibri'; COLOR: #000000; FONT-SIZE: small; FONT-WEIGHT: normal; TEXT-DECORATION: none">Doh!
Thanks Jolyon for clearing that misunderstanding on my part. I was aware of the
surrogate pair issue but I wrongly assumed that this might have been taken care
by the iterator implementation. I guess not.
<DIV> </DIV>
<DIV>Thanks again!</DIV>
<DIV>Cheers,</DIV>
<DIV>Colin</DIV>
<DIV>
<DIV> </DIV>
<DIV class=gmail_quote>On 23 November 2010 13:06, Jolyon Smith <SPAN
dir=ltr><<A
href="mailto:jsmith@deltics.co.nz">jsmith@deltics.co.nz</A>></SPAN>
wrote:<BR>
<BLOCKQUOTE
style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex"
class=gmail_quote>
<DIV lang=EN-NZ vlink="purple" link="blue">
<DIV>
<P class=MsoNormal><SPAN style="COLOR: #1f497d; FONT-SIZE: 11pt">Colin, the
for C in loop and the for i := 1 to Length() loops are functionally
identical! The only difference is that the “for in” version incurs the
slight overhead of the enumerator framework invoked by the compiler and
runtime magic to support that syntax.</SPAN></P>
<P class=MsoNormal><SPAN
style="COLOR: #1f497d; FONT-SIZE: 11pt"></SPAN> </P>
<P class=MsoNormal><SPAN style="COLOR: #1f497d; FONT-SIZE: 11pt">But in
neither case will the loop itself help detect/respond to surrogate pairs (a
single “WideChar” is potentially only ˝ the data required to form a complete
“<U>character</U>”). The only way to reduce an iterator over a string to
a simple char-wise loop, whether explicit or using enumerators, is to first
convert to UTF32, the facilities for which in the Delphi RTL are <cough>
rudimentary, to put it politely. Non-existent may be nearer the
mark.</SPAN></P>
<P class=MsoNormal><SPAN
style="COLOR: #1f497d; FONT-SIZE: 11pt"></SPAN> </P>
<P class=MsoNormal><SPAN style="COLOR: #1f497d; FONT-SIZE: 11pt">The precise
mechanics of the loop construct used is not material to that
problem.</SPAN></P>
<P class=MsoNormal><SPAN
style="COLOR: #1f497d; FONT-SIZE: 11pt"></SPAN> </P>
<P class=MsoNormal><SPAN
style="COLOR: #1f497d; FONT-SIZE: 11pt"></SPAN> </P>
<P class=MsoNormal><SPAN style="COLOR: #1f497d; FONT-SIZE: 11pt">However, just
as before Unicode when most people didn’t care and just wrote code that
assumed ANSI==ASCII, these days people won’t care and will write code that
assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate pairs
just as they used to ignore extended ASCII and ANSI characters.</SPAN></P>
<P class=MsoNormal><SPAN
style="COLOR: #1f497d; FONT-SIZE: 11pt"></SPAN> </P>
<P class=MsoNormal><SPAN style="COLOR: #1f497d; FONT-SIZE: 11pt">And for most
people, that will probably actually work.</SPAN></P>
<P class=MsoNormal><SPAN
style="COLOR: #1f497d; FONT-SIZE: 11pt"></SPAN> </P>
<P class=MsoNormal><SPAN
style="FONT-FAMILY: wingdings; COLOR: #1f497d; FONT-SIZE: 11pt">J</SPAN><SPAN
style="COLOR: #1f497d; FONT-SIZE: 11pt"></SPAN></P>
<P class=MsoNormal><SPAN
style="COLOR: #1f497d; FONT-SIZE: 11pt"></SPAN> </P>
<P class=MsoNormal><SPAN
style="COLOR: #1f497d; FONT-SIZE: 11pt"></SPAN> </P>
<DIV
style="BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0cm; PADDING-LEFT: 0cm; PADDING-RIGHT: 0cm; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<P class=MsoNormal><B><SPAN style="FONT-SIZE: 10pt"
lang=EN-US>From:</SPAN></B><SPAN style="FONT-SIZE: 10pt" lang=EN-US> <A
href="mailto:delphi-bounces@delphi.org.nz"
target=_blank>delphi-bounces@delphi.org.nz</A> [mailto:<A
href="mailto:delphi-bounces@delphi.org.nz"
target=_blank>delphi-bounces@delphi.org.nz</A>] <B>On Behalf Of </B>Colin
Johnsun<BR><B>Sent:</B> Tuesday, 23 November 2010 14:31<BR><B>To:</B> NZ
Borland Developers Group - Delphi List</SPAN></P>
<DIV class=im><BR><B>Subject:</B> Re: [DUG] Upgrading to XE - Unicode strings
questions</DIV></DIV>
<P class=MsoNormal> </P>
<P style="MARGIN-BOTTOM: 12pt" class=MsoNormal>I won't answer everything but
just on this one question:</P>
<DIV>
<DIV></DIV>
<DIV class=h5>
<DIV>
<P class=MsoNormal>On 23 November 2010 11:04, John Bird <<A
href="mailto:johnkbird@paradise.net.nz"
target=_blank>johnkbird@paradise.net.nz</A>> wrote:</P>
<P class=MsoNormal>Extra question:<BR><BR>It looks like code
like<BR><BR> for i:=1 to length(string1) do<BR>
begin<BR>
DoSomethingWithOneChar(string1[i]);<BR> end;<BR><BR>cannot be used
reliably. The problems are that length(string1) looks like<BR>it
cannot be safely used - as unicode characters may include 2 codepoints<BR>and
length(string1) highlights that there is a difference between the number<BR>of
unicode characters in a string and the number of codepoints.
Still<BR>figuring out what is the best practice here, as I have quite a lot of
string<BR>routines. Should be be OK as long as the unicode text
actually is ASCII.</P>
<DIV>
<P class=MsoNormal> </P></DIV>
<DIV>
<P class=MsoNormal> </P></DIV>
<DIV>
<P class=MsoNormal>you can use something like this:</P></DIV>
<DIV>
<P class=MsoNormal> </P></DIV>
<DIV>
<P class=MsoNormal>var</P></DIV>
<DIV>
<P class=MsoNormal> C: Char;</P></DIV>
<DIV>
<P class=MsoNormal>...</P></DIV>
<DIV>
<P class=MsoNormal> for C in String1 do</P></DIV>
<DIV>
<P class=MsoNormal> begin</P></DIV>
<DIV>
<P class=MsoNormal> DoSomethingWithOneChar(C);</P></DIV>
<DIV>
<P class=MsoNormal> end;</P></DIV>
<DIV>
<P class=MsoNormal> </P></DIV>
<DIV>
<P class=MsoNormal>In this case you don't need to know the index of each
character, you just get the char using the for..in..do loop.</P></DIV>
<DIV>
<P class=MsoNormal> </P></DIV>
<DIV>
<P class=MsoNormal> </P></DIV>
<DIV>
<P
class=MsoNormal> </P></DIV></DIV></DIV></DIV></DIV></DIV><BR>_______________________________________________<BR>NZ
Borland Developers Group - Delphi mailing list<BR>Post: <A
href="mailto:delphi@delphi.org.nz">delphi@delphi.org.nz</A><BR>Admin: <A
href="http://delphi.org.nz/mailman/listinfo/delphi"
target=_blank>http://delphi.org.nz/mailman/listinfo/delphi</A><BR>Unsubscribe:
send an email to <A
href="mailto:delphi-request@delphi.org.nz">delphi-request@delphi.org.nz</A>
with Subject: unsubscribe<BR></BLOCKQUOTE></DIV>
<DIV> </DIV></DIV>
<P>
<HR>
_______________________________________________<BR>NZ Borland Developers Group -
Delphi mailing list<BR>Post: delphi@delphi.org.nz<BR>Admin:
http://delphi.org.nz/mailman/listinfo/delphi<BR>Unsubscribe: send an email to
delphi-request@delphi.org.nz with Subject:
unsubscribe</DIV></DIV></DIV></BODY></HTML>