[DUG] Upgrading to XE - Unicode strings questions

Tue Nov 23 14:32:45 NZDT 2010

Jolyon beat me to answer those questions .. but here are my additional 2 cents:

Q1: Unicode strings treat each character as 2 bytes - "length" returns the number of characters, not the "size" of memory allocated. Each access to it with an array syntax returns you a widechar instead of an ansichar. Your "DoSomethingWithOneChar" procedure will be called with a widechar as input but that probably won't cause any problems as widechar is a superset of ansichar so there won't be any issues when going in that direction.

Q8: stringlist.loadfromfile will auto-detect the encoding by looking for magic markers (BOM code 0xEF 0xBB 0xBF for UTF8 at the beginning of the file) and other things, like Unicode-codepoint encoding validity.

Q11: inifiles: yes, these files will now have support for Unicode too. 

Q13: Unicode is synonymous for “character encoding of the universal character set” – so it actually consists of two parts, the character set (about 109,000 characters are officially defined) and the various encoding formats that are used to represent those characters (utf8/utf16/utf32/ucs2/ucs4/etc). Windows started with UCS-2 (in Windows NT) and then switched to UTF16. UCS-2 only allowed 65535 characters so Microsoft had to switch to UTF-16 in newer windows version to support the full character set. This means that some weird and/or no longer used characters from dead/historic languages can sometimes take up more than 2 bytes (the size of a widechar) – this isn’t usually an issue when developing Unicode enabled applications … unless your software needs to handle and display things like “cuneiform script” perfectly.

Kind Regards,
Stefan Mueller 
_______________________
R&D Manager
ORCL Toolbox LLP, Japan
http://www.orcl-toolbox.com <http://www.orcl-toolbox.com/>  

From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz] On Behalf Of Jolyon Smith
Sent: Tuesday, November 23, 2010 9:40 AM
To: 'NZ Borland Developers Group - Delphi List'
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

I'm guessing my response to your previous email didn't come thru for some reason - resending:

I shall address some of your questions that I can answer quickly:

Q2 – With XE do the .pas and .dfm files become unicode text and hence 

      cannot be read by earlier Delphi, eg D2007 any more?

I forget precisely which version of the IDE introduced the change, but the IDE has for some time supported different encodings for source/DFM files.  Certainly this was present in D2006 and it may even have been as far back as D7 or even earlier that it was introduced.

(Right click in source/dfm file and choose "File Format" from the context menu to see/change the file encoding)

Q3 – I do a lot of reading ascii data files, and writing back.   Using 

mainly TFilestream and stringlists.

Which TFileStream you should be OK, as long as you read/write into ANSIString/ANSIChar buffers as you already surmised.

With TStringList you are forced to push your data through a Unicode/ANSI conversion when reading/writing from/to ANSI files, since the TStringList itself holds UnicodeString items.  You can do this using the new "Encoding" parameter to the relevant methods of the class to ensure you read/write the correct/expected encoding (reading should correctly detect the encoding, but when writing you will need to be explicit).

Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

Q5 – if I do as1:=s2 does this convert a unicode string to ansstring?

Yes, but you will get a warning when going from Unicode to ANSI (since not all ANSI encodings will support the possible content of a Unicode string).  To avoid this, be explicit with the conversion.

Q6 – I understand any code like

            char1:=string1[i];

            if char1 in [‘a’..’z’] then

            begin

                    message:=string[i]+’ - character is lowercase’;

            end

        will break.

Nope, it's fine.  But again, you will get a warning, in this case that the WIDECHAR has been reduced to a BYTE (NOTE: not converted to ANSICHAR) and a suggestion that you use CharInSet() instead.

Note however that CharInSet contains no real "magic" that makes sets work for > 255 elements - it merely provides a wrapper around code that will avoid the suggestion that you use CharInSet().  You can achieve the same effect by again simply being explicit that you know that what you are doing is intended and safe by reducing the WideChar to an ANSIChar yourself:

  if ANSICHAR(char1) in ['a'..'z'] then

To my mind this is preferable to using CharInSet() as it makes it clearer in the code what is going on (that non-ANSIChars are not expected and may not be handled as intended).  Using CharInSet() won't make any material difference to the behaviour of the code, but it would make it less apparent what is going on (i.e. that your code deals specifically with ANSI chars).

CharInSet() performs a test for the Char being (C < #$0100), but if your code is dealing with ANSI chars packaged in Unicode strings then this test is redundant, and using CharInSet() hides the intent of your code - to deal specifically with ANSI.

That is just my preference however.  Ymmv.

Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9 means tab?

Yes.  But one thing to be aware of is that #nnn won't necessarily yield an ANSIChar(nnn).

Q8 – stringlist1.loadfromfile(‘Test1.txt’);

        what happens if this file is ascii text being read into a stringlist which is unicode strings.

The stringlist will contain UnicodeStrings, converted from the ASCII file content that was loaded.

Q9 -   stringlist1.savetofile(‘Test1.txt’)

         presumably this is no longer ascii text.

It won't be ASCII (but technically it never was :)) it will be ANSI unless you.

Q9a - How do I save and read a stringlist to/from a file if it is to be Ansi text?

As you would have done before.  It is if you want to save to something other than ANSI that you have to invoke the Encodings parameter, for example to save as UTF8:

  strings.SaveToFile(filename, TEncoding.UTF8);

Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type (for ansistrings) as well as a unicode TStringlist type?

NOPE!   A shocking omission imho.

Q12 – does Windows Notepad open unicode text files correctly?

Yep – and it’s a handy tool for testing variations in encoding (In Notepad when you “Save As” you can choose the encoding: ANSI, UTF8, BE Unicode or LE Unicode (here “Unicode” = UTF16).

When you “Save As” a file that you previously opened, the default encoding selected will reflect the encoding of the file when it was opened.

Q13 - It looks like most programmers editors read and write ascii and unicode encoding.....the one I use seems to distinguish between UTF-8 and unicode as well – what is the difference?

UTF-8 *is* Unicode.

Unicode is a character set (technically it is more than that, but for the purposes of this explanation that definition will suffice).

UTF8/16/32 are different *encodings* for that character set.  For UTF16 and UTF32 there are also Big and Little Endianed variants.

As noted before, in Notepad, and possibly in other apps, the term “Unicode” denotes “UTF16”.

UTF32 is rarely encountered in the wild, which might explain why there is no TEncoding support for it (and indeed why Notepad doesn’t appear to support it).

As far as the difference between ASCII and UTF8 encoded Unicode goes:

An ASCII file can represent only characters 0..128 and each character is certain to occupy a single byte.

A UTF8 file can represent *EVERY* Unicode character, not just ASCII, but characters with codepoints > 127 will occupy 2 or more bytes.

You may have spotted that for an ASCII file, ASCII and UTF8 encoding are physically indistinguishable at the character data level.  However, a *true* UTF8 file (as opposed to an ASCII file that could be treated naively as UTF8 – or vice versa) will have a BOM (Byte Order Marker).

A BOM is a sequence of bytes that is prepended to a file (or stream) to indicate the Unicode encoding and identify the byte order for those encodings that have big/little endian variants.

I hope that all helps a little.

J

-----Original Message-----
From: delphi-bounces at delphi.org.nz [mailto:delphi-bounces at delphi.org.nz] On Behalf Of John Bird
Sent: Tuesday, 23 November 2010 13:04
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Thanks for the references, so I can answer most of the questions now. 

Here is what I understand so far, if anyone has anything to add this will be 

useful!

Extra question:

It looks like code like

    for i:=1 to length(string1) do

    begin

            DoSomethingWithOneChar(string1[i]);

    end;

cannot be used reliably.   The problems are that length(string1) looks like 

it cannot be safely used - as unicode characters may include 2 codepoints 

and length(string1) highlights that there is a difference between the number 

of unicode characters in a string and the number of codepoints.   Still 

figuring out what is the best practice here, as I have quite a lot of string 

routines.   Should be be OK as long as the unicode text actually is ASCII.

Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot

be read by earlier Delphi, eg D2007 any more?

Answer - Is a project option from what I have read?, yes not portable if 

unicode.

Q3 – I do a lot of reading ascii data files, and writing back.   Using

mainly TFilestream and stringlists.   Does this in general mean I will need

to use file variables declared as Ansichar and AnsiString instead of Char

and String?

(I would prefer to use the standard VCL where possible)

If I have variables

        as1:Ansistring;

        s2:string;

Q4 –         if I do s2:=as1  does this convert ansistrings to unicode?

Answer - yes, there are performance issues to watch out for if conversion 

happens a lot.

Q5 – if I do as1:=s2 does this convert a unicode string to ansistring?

    (otherwise how do I do this?)

Answer - yes, there are performance issues to watch out for if conversion 

happens a lot.

Q6 – I understand any code like

            char1:=string1[i];

            if char1 in [‘a’..’z’] then

            begin

                    message:=string[i]+’ - character is lowercase’;

            end

        will break, as ansi characters are ordinal (less than 256 or 512)

and set comparisons ['a'..'z']  or ['a','b','c']    can be used, this set

code cannot be used for unicode characters.   What is the replacement?

Answer - There is CharInSet call and numerous extra housekeeping functions 

added in TCharacter.

Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9

means tab?

        if I have code like (logline string1 string2 are string)

        logline:=FormatDateTime(‘dd-mmm-yyyy hh:nn:ss’,now) + string1 +

#13#10+#9 + string2;

        ShowMessage(logline);

        Button1.hint:=logline;

        writeln(f,logline);

        these work D5-D2007   - ie a 2 line messagebox text, 2 line hint,

and 2 lines written to a log file.

        is this still going to work?

        do carriage returns/tabs/other control characters have to be defined

differently, eg as constants?

Answer - not figured out yet - anyone else know?

Q8 – stringlist1.loadfromfile(‘Test1.txt’);

        what happens if this file is ascii text being read into a stringlist

which is unicode strings.

Answer - Default is Ascii text for loadfromfile and savetofile, use 

overloaded routines for Unicode

Q9 -   stringlist1.savetofile(‘Test1.txt’)

         presumably this is no longer ascii text.   How do I save and read a

stringlist to/from a file if it is to be Ansi text?

Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist

type (for ansistrings) as well as a unicode TStringlist type?

        (I use stringlists a lot)

Answer - unicodestring lists can save to ascii or unicode files, so 

TAnsiStringlist not needed.

Q11 – do inifiles become unicode too?

Answer - looks like no?  Not clear?  Anyone else know?

Q12 – does Windows Notepad open unicode text files correctly?   or can it

only be used on Ansi text files?

Anyone know this?

Q13 - It looks like most programmers editors read and write ascii and

unicode encoding.....the one I use seems to distinguish between UTF-8 and

unicode as well – what is the difference?

Anyone know this?

John

_______________________________________________

NZ Borland Developers Group - Delphi mailing list

Post: delphi at delphi.org.nz

Admin: http://delphi.org.nz/mailman/listinfo/delphi

Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: 

unsubscribe 

_______________________________________________

NZ Borland Developers Group - Delphi mailing list

Post: delphi at delphi.org.nz

Admin: http://delphi.org.nz/mailman/listinfo/delphi

Unsubscribe: send an email to delphi-request at delphi.org.nz with Subject: unsubscribe

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://listserver.123.net.nz/pipermail/delphi/attachments/20101123/690a424c/attachment-0001.html