[DUG] Upgrading to XE - Unicode strings questions

Wed Nov 24 11:27:07 NZDT 2010

Hi John

You can find out whether a unicode string is inside the BMP by 
converting it to UTF-32 and checking that the new string is twice the 
length of the original (UTF-16) string.
> A user could specifically choose to enter that character in either form - this is unlikely, yes.  Or, two users using the same codepage could choose to enter the character differently.
>
> Or if your data is coming from two separate external sources.
>
> The *only* way to be sure is to normalise before processing.
>    
Agreed. That will eliminate any issues with composite codepoints.
>> You only ever get issues if you cross codepage boundaries
>> (like for example if you have users in different countries
>> storing data in a database - which is why international
>> databases often use UTF-8 to store data instead of their
>> native charactersets).
>>      
> This makes no sense at all to me.
>
> "ö" encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8.  Whether you encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint vs a character followed by a diacritic are still two distinct "character" sequences.
>    
True. I think the point is that UTF-8 is the most compact format without 
data loss, regardless of whether the codepoints are composite or not.

Todd.