<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:Wingdings;
        panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p
        {mso-style-priority:99;
        mso-margin-top-alt:auto;
        margin-right:0cm;
        mso-margin-bottom-alt:auto;
        margin-left:0cm;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0cm;
        margin-right:0cm;
        margin-bottom:0cm;
        margin-left:36.0pt;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
span.EmailStyle18
        {mso-style-type:personal-reply;
        font-family:"Calibri","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:386227144;
        mso-list-type:hybrid;
        mso-list-template-ids:-842378380 -49277090 336134147 336134149 336134145 336134147 336134149 336134145 336134147 336134149;}
@list l0:level1
        {mso-level-start-at:0;
        mso-level-number-format:bullet;
        mso-level-text:\F0D8;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;
        font-family:Wingdings;
        mso-fareast-font-family:"Times New Roman";
        mso-bidi-font-family:Arial;
        color:#1F497D;}
ol
        {margin-bottom:0cm;}
ul
        {margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-NZ link=blue vlink=purple><div class=WordSection1><div><div><div><p class=MsoListParagraph><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>As I understand it iterating over a string with Chars does get around the problem of surrogate pairs<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>It depends what you mean by &#8220;get around the problem&#8221;.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal style='text-indent:36.0pt'><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>for c in string do WorkWith( c );<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Will iterate once for each c (WIDECHAR) in s.&nbsp; Some of those c&#8217;s may be in surrogate pairs, but you will get only 1 of each half of each pair at a time.&nbsp; So if your WorkWith() routine simply ignores surrogate pairs then yes, you got around the problem.&nbsp; &nbsp;But if WorkWith() needs to work on discrete codepoints beyond the BMP then you have some extra work to do before you can call WorkWith(), and you must call it with a UTF32 parameter, NOT a UTF16 WideChar (unless WorkWith() has some way of keeping track of calls made to it, and doing the job of combining surrogates for itself &#8211; which is unlikely I think).<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>But crucially, <i>for c in s</i> is absolutely <b><u>no different</u></b> from:<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal style='text-indent:36.0pt'><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>for i := 1 to Length(s) do WorkWith( s[i] );<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>They do exactly the same thing &#8211; namely iterate over each widechar in the string.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal style='text-indent:36.0pt'><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>as any character you are currently on might be either 1,2 or more bytes if it contains </span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal style='text-indent:36.0pt'><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>surrogate pairs, but just one unicode character</span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>This makes no sense.&nbsp; *<b>Every</b>* character (WIDECHAR) that you &#8220;are on&#8221; will be 2 bytes.&nbsp; No more.&nbsp; No Less.&nbsp; &nbsp;The number of the bytes shall be 2, and 2 shall be the number.&nbsp; What those 2 bytes <i>represent</i> may be either a complete Unicode codepoint (in the BMP) or one of either a hi/lo char in a surrogate pair, which must be combined to derive the codepoint they represent.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>&nbsp;</span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:#1F497D'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style='font-family:"Calibri","sans-serif";color:black'>what do&nbsp; you use instead of length to get the number of characters in the string in general?</span><span style='font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Length(s) returns the number of WIDEChars. The number of &#8220;n&#8221; for which s[n] is valid.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-family:"Calibri","sans-serif";color:#1F497D'>&nbsp;&nbsp;&nbsp; </span><span style='font-family:"Calibri","sans-serif";color:black'>&nbsp; length is not the number of characters, its the umber of </span><span style='font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style='font-family:"Calibri","sans-serif";color:black'>code-points (including surrogate pairs counted separately)&nbsp; if I </span><span style='font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style='font-family:"Calibri","sans-serif";color:black'>understand correctly.</span><span style='font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Nope &#8211; you understand <i>in</i>correctly.&nbsp; </span><span style='font-size:11.0pt;font-family:Wingdings;color:#1F497D'>J</span><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>&nbsp;<o:p></o:p></span></p></div><div><p class=MsoNormal style='text-indent:36.0pt'><span style='font-family:"Calibri","sans-serif";color:black'>Separate issue - I understand that if one wants to iterate over the bytes of a string </span><span style='font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style='font-family:"Calibri","sans-serif";color:black'>then one uses byte rather than char, and then one does have to investigate each byte </span><span style='font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style='font-family:"Calibri","sans-serif";color:black'>to see if it is part of a surrogate pair.</span><span style='font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p>&nbsp;</o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>No, this is what you have to do with WideChars in a string.&nbsp; You use bytes if you don&#8217;t care about the characters at all and simply want to work with the raw byte data.&nbsp; Unlikely in the context of the questions you are asking here, I would add.<o:p></o:p></span></p></div></div></div></div></body></html>