<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40" xmlns:ns0="urn:schemas-microsoft-com:office:smarttags"><head><meta http-equiv=Content-Type content="text/html; charset=iso-8859-1"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:Wingdings;
        panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
        {font-family:"MS Mincho";
        panose-1:2 2 6 9 4 2 5 8 3 4;}
@font-face
        {font-family:"MS Mincho";
        panose-1:2 2 6 9 4 2 5 8 3 4;}
@font-face
        {font-family:"Arial Unicode MS";
        panose-1:2 11 6 4 2 2 2 2 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:"\@MS Mincho";
        panose-1:2 2 6 9 4 2 5 8 3 4;}
@font-face
        {font-family:"\@Arial Unicode MS";
        panose-1:2 11 6 4 2 2 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
        {mso-style-priority:99;
        mso-style-link:"Balloon Text Char";
        margin:0in;
        margin-bottom:.0001pt;
        font-size:8.0pt;
        font-family:"Tahoma","sans-serif";}
span.EmailStyle17
        {mso-style-type:personal;
        font-family:"Calibri","sans-serif";
        color:#1F497D;}
span.EmailStyle18
        {mso-style-type:personal-reply;
        font-family:"Arial","sans-serif";
        color:#1F497D;
        font-weight:normal;
        font-style:normal;}
span.BalloonTextChar
        {mso-style-name:"Balloon Text Char";
        mso-style-priority:99;
        mso-style-link:"Balloon Text";
        font-family:"Tahoma","sans-serif";}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#1F497D'>You are absolutely right – if you need to know the “real” number of characters then utf32 is the way to go. I use the jedi-library for some advanced things – they have a unicode library that supports utf32/ucs-4 properly together with helper functions that actually work correctly for changing things like uppercase/lowercase on those characters.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#1F497D'>But for most people the scripts/languages supported in the basic multilingual plane (plane 0 .. or what is known as “the characters that fit into the first 64k range and hence have no problem with being represented as UTF16/UCS-2) will do just fine … occurrences of codepoints above the 64k range don’t really happen in the real world – they are special cases and for most applications it isn’t worth the trouble/effort to handle them. <o:p></o:p></span></p><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#1F497D'><o:p> </o:p></span></p><div><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:navy'><br></span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>Kind Regards,<br>Stefan Mueller</span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#585757'> <br></span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#666699'>_______________________<br>R&D Manager<br>ORCL Toolbox LLP, </span><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'><ns0:place><ns0:country-region><ns0:country-region><ns0:place><span style='color:#666699'>Japan</span></ns0:place></ns0:country-region></ns0:country-region></ns0:place><span style='color:#666699'><br></span><span style='color:blue'><a href="http://www.orcl-toolbox.com/" title="blocked::http://www.orcl-toolbox.com/">http://www.orcl-toolbox.com</a></span><span style='color:#585757'> </span></span><span style='font-size:10.0pt;font-family:"Calibri","sans-serif";color:black'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:navy'> </span><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p></div><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#1F497D'><o:p> </o:p></span></p><div><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> delphi-bounces@delphi.org.nz [mailto:delphi-bounces@delphi.org.nz] <b>On Behalf Of </b>Jolyon Smith<br><b>Sent:</b> Tuesday, November 23, 2010 11:07 AM<br><b>To:</b> 'NZ Borland Developers Group - Delphi List'<br><b>Subject:</b> Re: [DUG] Upgrading to XE - Unicode strings questions<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Colin, the for C in loop and the for i := 1 to Length() loops are functionally identical! The only difference is that the “for in” version incurs the slight overhead of the enumerator framework invoked by the compiler and runtime magic to support that syntax.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>But in neither case will the loop itself help detect/respond to surrogate pairs (a single “WideChar” is potentially only ½ the data required to form a complete “<u>character</u>”). The only way to reduce an iterator over a string to a simple char-wise loop, whether explicit or using enumerators, is to first convert to UTF32, the facilities for which in the Delphi RTL are <cough> rudimentary, to put it politely. Non-existent may be nearer the mark.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>The precise mechanics of the loop construct used is not material to that problem.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>However, just as before Unicode when most people didn’t care and just wrote code that assumed ANSI==ASCII, these days people won’t care and will write code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate pairs just as they used to ignore extended ASCII and ANSI characters.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>And for most people, that will probably actually work.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:Wingdings;color:#1F497D'>J</span><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-NZ style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> delphi-bounces@delphi.org.nz [mailto:delphi-bounces@delphi.org.nz] <b>On Behalf Of </b>Colin Johnsun<br><b>Sent:</b> Tuesday, 23 November 2010 14:31<br><b>To:</b> NZ Borland Developers Group - Delphi List<br><b>Subject:</b> Re: [DUG] Upgrading to XE - Unicode strings questions<o:p></o:p></span></p></div><p class=MsoNormal><span lang=EN-NZ><o:p> </o:p></span></p><p class=MsoNormal style='margin-bottom:12.0pt'><span lang=EN-NZ>I won't answer everything but just on this one question:<o:p></o:p></span></p><div><p class=MsoNormal><span lang=EN-NZ>On 23 November 2010 11:04, John Bird <<a href="mailto:johnkbird@paradise.net.nz">johnkbird@paradise.net.nz</a>> wrote:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-NZ>Extra question:<br><br>It looks like code like<br><br> for i:=1 to length(string1) do<br> begin<br> DoSomethingWithOneChar(string1[i]);<br> end;<br><br>cannot be used reliably. The problems are that length(string1) looks like<br>it cannot be safely used - as unicode characters may include 2 codepoints<br>and length(string1) highlights that there is a difference between the number<br>of unicode characters in a string and the number of codepoints. Still<br>figuring out what is the best practice here, as I have quite a lot of string<br>routines. Should be be OK as long as the unicode text actually is ASCII.<o:p></o:p></span></p><div><p class=MsoNormal><span lang=EN-NZ><o:p> </o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ><o:p> </o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ>you can use something like this:<o:p></o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ><o:p> </o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ>var<o:p></o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ> C: Char;<o:p></o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ>...<o:p></o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ> for C in String1 do<o:p></o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ> begin<o:p></o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ> DoSomethingWithOneChar(C);<o:p></o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ> end;<o:p></o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ><o:p> </o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ>In this case you don't need to know the index of each character, you just get the char using the for..in..do loop.<o:p></o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ><o:p> </o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ><o:p> </o:p></span></p></div><div><p class=MsoNormal><span lang=EN-NZ> <o:p></o:p></span></p></div></div></div></body></html>