[DUG] Upgrading to XE - Unicode strings questions
Todd
todd.martin.nz at gmail.com
Wed Nov 24 11:27:07 NZDT 2010
Hi John
You can find out whether a unicode string is inside the BMP by
converting it to UTF-32 and checking that the new string is twice the
length of the original (UTF-16) string.
> A user could specifically choose to enter that character in either form - this is unlikely, yes. Or, two users using the same codepage could choose to enter the character differently.
>
> Or if your data is coming from two separate external sources.
>
> The *only* way to be sure is to normalise before processing.
>
Agreed. That will eliminate any issues with composite codepoints.
>> You only ever get issues if you cross codepage boundaries
>> (like for example if you have users in different countries
>> storing data in a database - which is why international
>> databases often use UTF-8 to store data instead of their
>> native charactersets).
>>
> This makes no sense at all to me.
>
> "ö" encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8. Whether you encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint vs a character followed by a diacritic are still two distinct "character" sequences.
>
True. I think the point is that UTF-8 is the most compact format without
data loss, regardless of whether the codepoints are composite or not.
Todd.
More information about the Delphi
mailing list