[DUG] Upgrading to XE - Unicode strings questions

Wed Nov 24 16:14:15 NZDT 2010

You should be fine - you just have to ensure you normalise the strings.

You're going to have to convert from UTF-8 to UTF-16 to bring them in to your Delphi app anyway, for processing, so you may as well normalise them in the process.

UTF-16 was chosen in Delphi because it is also the "native" encoding in Windows itself.

-----Original Message-----
From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz] On Behalf Of Ross Levis
Sent: Wednesday, 24 November 2010 16:00
To: 'NZ Borland Developers Group - Delphi List'
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

It's a shame UTF-8 wasn't made the standard in Delphi.  It's commonly used in audio file tags, for example, which I have to deal with.

My software needs to search for songs with specific artists or titles, and it sounds like I'm going to have problems where the information is visually the same but entered differently in different parts of the world, using all sorts of 3rd party software.

Ross.

-----Original Message-----
From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz] On Behalf Of Todd
Sent: Wednesday, 24 November 2010 11:27 AM
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Hi John

You can find out whether a unicode string is inside the BMP by 
converting it to UTF-32 and checking that the new string is twice the 
length of the original (UTF-16) string.
> A user could specifically choose to enter that character in either form - this is unlikely, yes.  Or, two users using the same codepage could choose to enter the character differently.
>
> Or if your data is coming from two separate external sources.
>
> The *only* way to be sure is to normalise before processing.
>    
Agreed. That will eliminate any issues with composite codepoints.
>> You only ever get issues if you cross codepage boundaries
>> (like for example if you have users in different countries
>> storing data in a database - which is why international
>> databases often use UTF-8 to store data instead of their
>> native charactersets).
>>      
> This makes no sense at all to me.
>
> "ö" encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8.  Whether you encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint vs a character followed by a diacritic are still two distinct "character" sequences.
>    
True. I think the point is that UTF-8 is the most compact format without 
data loss, regardless of whether the codepoints are composite or not.

Todd.

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi at delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: unsubscribe

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi at delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: unsubscribe