[DUG] Upgrading to XE - Unicode strings questions

Wed Nov 24 09:14:10 NZDT 2010

> I think you are confusing Canonical & Normalized versions 
> of the same Unicode string (in the example s1 is canonical, 
> s2 is normalized) and the effect of local codepage conversion.

Yep, and for the record I think this is a big problem with the way Embarcadero implemented Unicode.

By pursuing the "Unicode is a no-brainer" approach (facilitating easy migration for ASCII apps) they have obfuscated the fact that Unicode is far from simple.  Or at least doing it right is.

Danny Thorpe opined years ago that it made a lot of sense to do 64-bit and Unicode in one go as a big-bang breaking change, leaving the 32-bit, ANSI VCL product behind as a legacy platform.  Danny Thorpe always was a clever guy!  ;)

> The "ö" can be written as a compound #$006F + #$0308 in 
> canonical format ... and as #$00f6 in the "normalized" 
> format. For most normal applications it just doesn't really 
> matter either way because a user that is inputting text under 
> his local codepage will always do it the same way

A user could specifically choose to enter that character in either form - this is unlikely, yes.  Or, two users using the same codepage could choose to enter the character differently.

Or if your data is coming from two separate external sources.

The *only* way to be sure is to normalise before processing.

> You only ever get issues if you cross codepage boundaries 
> (like for example if you have users in different countries 
> storing data in a database - which is why international 
> databases often use UTF-8 to store data instead of their 
> native charactersets).

This makes no sense at all to me.

"ö" encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8.  Whether you encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint vs a character followed by a diacritic are still two distinct "character" sequences.