[DUG] Upgrading to XE - Unicode strings questions

Wed Nov 24 12:29:17 NZDT 2010

> You can find out whether a unicode string is inside the BMP 
> by converting it to UTF-32

No need to go to that trouble, just test for surrogates:

Uses Character;

  for i := 1 to Length(s) do
    if IsSurrogate( s[i] ) then
	// s contains non-BMP characters

> I think the point is that UTF-8 is the most compact format without
> data loss, regardless of whether the codepoints are composite or not.

The point *seemed* to be that UTF-8 somehow avoided problems with composite characters, which is simply not the case and I wanted to clarify that point.

As for being the most compact - If your data is primarily ASCII in nature then yes UTF-8 is the most compact but if it isn't then UTF16 could easily be more compact.  It all depends on the data.  There is no absolute rule in that regard.

And of course, you pay for that compactness by incurring additional processing overhead when dealing with the strings as soon as you have any non-ASCII character involved (and *some* of that overhead is incurred just IN CASE you have such non-ASCII characters).