Discussion:
Pre Delphi 2008-9 Unicode Do's and Dont's
(too old to reply)
Lee Jenkins
2008-07-21 16:40:11 UTC
Permalink
Has anyone posted information concerning do's and dont's for Unicode support in
upcoming Delphi versions?

It recent threads concerning Delphi/Unicode, I think the topic of being prepared
for Unicode has not been addressed so much, at least as far as I can see.

On one side, we have applications that have already been written whose authors
are rightfully concerned about compatibility.

On the other side, we have applications which are yet to be written and do not
have much threat of being

In the middle, we have applications which are currently being written (raises
hand) which could benefit from some suggestions on best practices to give the
applications currently being written to have a chance of being ported more
easily when D2008/9 is finally released.

--
Warm Regards,

Lee
Nick Hodges (Embarcadero)
2008-07-21 17:04:22 UTC
Permalink
Post by Lee Jenkins
Has anyone posted information concerning do's and dont's for Unicode
support in upcoming Delphi versions?
I'll be posting some articles on this very soon.

Short list:

Don't assume that the size of a Char is one.

Don't assume that the size of an array of Char is the same as the
Length of the string held in the array of Char.
--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges
John Herbster
2008-07-21 17:27:46 UTC
Permalink
Post by Nick Hodges (Embarcadero)
Post by Lee Jenkins
Has anyone posted information concerning do's and dont's for
Unicode support in upcoming Delphi versions?
I'll be posting some articles on this very soon.
Nick, Here are a few suggestions and clarifications.
Post by Nick Hodges (Embarcadero)
Don't assume that the size of a Char is one.
Please start with the compiler op to switch def of "Char".
Post by Nick Hodges (Embarcadero)
Don't assume that the SizeOf an array of Char is the same as the
Length of the string held in the array of Char [less one].
Show us how to iterate through a string of characters with indexes.

Show us how to iterate through a string of characters with pointers.

Show us how to load and store a string from and to TStreams.

Show us how to replace a character.

Show us how to make literal constants an assign them to strings.

Show us how to pass strings to and from DLLs.

Regards, JohnH
Nick Hodges (Embarcadero)
2008-07-21 17:44:42 UTC
Permalink
Post by John Herbster
Show us how to iterate through a string of characters with indexes.
Exactly as before.
Post by John Herbster
Show us how to iterate through a string of characters with pointers.
Exactly as before -- but don't assume a character is of size 1.
Post by John Herbster
Show us how to load and store a string from and to TStreams.
Exactly as before but you can't assume that the length of a string char
is 1.
Post by John Herbster
Show us how to replace a character.
Exactly as before.
Post by John Herbster
Show us how to make literal constants an assign them to strings.
Exactly as before.
Post by John Herbster
Show us how to pass strings to and from DLLs.
Just as before, but again, don't assume that Char = 1 byte.
--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges
John Herbster
2008-07-21 18:27:25 UTC
Permalink
Post by Nick Hodges (Embarcadero)
Exactly as before -- but don't assume a character is of size 1.
Thanks Nick!
Post by Nick Hodges (Embarcadero)
Post by John Herbster
Show us how to iterate through a string of characters with pointers.
Exactly as before -- but don't assume a character is of size 1.
May I presume like this?
p := @MyString[1];
Inc(p);
where MyStr: string; and p: PChar;

And how expensive are these operations during CPU execution?

TIA, JohnH
Nick Hodges (Embarcadero)
2008-07-21 18:45:44 UTC
Permalink
Post by John Herbster
May I presume like this?
Inc(p);
where MyStr: string; and p: PChar;
Yes -- just like before.
Post by John Herbster
And how expensive are these operations during CPU execution?
Minimal -- it's very efficient. It's pointer math, right? ;-)
--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges
Paul Scott
2008-07-22 10:31:20 UTC
Permalink
On Mon, 21 Jul 2008 19:45:44 +0100, Nick Hodges (Embarcadero)
Post by Nick Hodges (Embarcadero)
Post by John Herbster
May I presume like this?
Inc(p);
where MyStr: string; and p: PChar;
Yes -- just like before.
Err... Nick,

Doesn't this depend on exactly what you mean by "a character" ?


If CG had gone for UTF-32 encoding, there would have indeed have been a
one-to-one correspondence between a position in the string and what most
people would think of as "a character" - ie. a CodePoint.

But since (according to the blogs) Tiburon will be using "string=UTF-16",
then any character (CodePoint) which lies outside the BMP (ie. >64K) has
to be represented by a /pair/ of 16-bit CodeValues.

So, since your blog also says that for Tiburon "Char=WideChar=word", one
external "character" may actually need two internal "characters" to
represent it.

Now while most Delphi programs hopefully won't ever see any of these
million or so extended CodePoints, anywhere that a program allocates
memory on the basis of "ExpectedNumberOfChars x SizeOf(Char)" or, like the
code snippet above, might presume that each index position in a "string"
represents "one character" in some writing system leaves a ticking
timebomb.
--
Paul Scott
Information Management Systems
Macclesfield, UK.
Thorsten Engler [NexusDB]
2008-07-22 12:04:48 UTC
Permalink
Post by Paul Scott
Doesn't this depend on exactly what you mean by "a character" ?
Exactly the same thing that an AnsiString meant by "a character". Where
you could have DBCS and MBCS.
Post by Paul Scott
But since (according to the blogs) Tiburon will be using
"string=UTF-16", then any character (CodePoint) which lies outside
the BMP (ie. >64K) has to be represented by a pair of 16-bit
CodeValues.
UTF-16 is the native type for unicode character data on the Windows
platform. Any Windows API taking or returning unicode strings uses
UTF-16. Internally ALL string data under NT based Windows versions (NT,
2000, and newer) is using UTF-16. All the ANSI APIs still offered are
just wrappers which perform the conversion into UTF-16 and call the
actual native API.

The use of any other encoding as default for unicode strings on Windows
does not make sense.
Post by Paul Scott
So, since your blog also says that for Tiburon "Char=WideChar=word",
one external "character" may actually need two internal "characters"
to represent it.
Again, this is no different at all from DBCS and MBCS ANSI encodings.

The huge difference is that in UTF-16 the value ranges for singelton,
leading and trailing units do NOT overlap. That means that in almost
all cases you can ignore the fact that UTF-16 contains surrogate pairs.

In DBCS/MBCS ANSI you have to parse the strings always starting from
the beginning because you might need to skip ahead if you hit a lead
byte because one of the trailing bytes might have a value that on it's
own would represent a valid singelton (with a totally different meaning
then this trailing byte in combination with it's lead byte).

In UTF-16 a trailing unit seen on it's own can never ever be mistaken
for a different singelton value.
Post by Paul Scott
Now while most Delphi programs hopefully won't ever see any of these
million or so extended CodePoints, anywhere that a program allocates
memory on the basis of "ExpectedNumberOfChars x SizeOf(Char)" or,
like the code snippet above, might presume that each index position
in a "string" represents "one character" in some writing system
leaves a ticking timebomb.
Again, Delphi doesn't have any more or less problems with this then any
other application on the Windows platform. UTF16 is the native unicode
encoding on windows. Using UTF-16 is much much less of a problem then
trying to support ANSI for DBCS or MBCS codepages.

--
Paul Scott
2008-07-22 13:16:30 UTC
Permalink
On Tue, 22 Jul 2008 13:04:48 +0100, Thorsten Engler [NexusDB]
Post by Thorsten Engler [NexusDB]
Again, Delphi doesn't have any more or less problems with this then any
other application on the Windows platform. UTF16 is the native unicode
encoding on windows. Using UTF-16 is much much less of a problem then
trying to support ANSI for DBCS or MBCS codepages.
Thorsten,

I don't disagree with anything you said - and it was probably all obvious
to those who have been struggling with multiple code pages in the past.

But for the rest of us with applications in "English", there's more to
"Unicodifying an Application" than just a recompilation - even with a
liberal sprinkling of "SizeOf(Char)"
--
Paul Scott
Information Management Systems
Macclesfield, UK.
Allen Bauer (CodeGear)
2008-07-25 21:19:33 UTC
Permalink
Post by Paul Scott
On Tue, 22 Jul 2008 13:04:48 +0100, Thorsten Engler [NexusDB]
Post by Thorsten Engler [NexusDB]
Again, Delphi doesn't have any more or less problems with this then
any other application on the Windows platform. UTF16 is the native
unicode encoding on windows. Using UTF-16 is much much less of a
problem then trying to support ANSI for DBCS or MBCS codepages.
Thorsten,
I don't disagree with anything you said - and it was probably all
obvious to those who have been struggling with multiple code pages
in the past.
But for the rest of us with applications in "English", there's more
to "Unicodifying an Application" than just a recompilation - even
with a liberal sprinkling of "SizeOf(Char)"
If all your application ever expected to get was "English," just
because strings now support a much larger set of codepoints doesn't
suddenly mean that your application will magically begin to get all
kinds of characters outside the normal English characters in the ASCII
range. Only if your application were deployed to some region where
there is the possibility of non-English characters being encountered
will you run in to an issue. Rebuilding an application should not alter
its existing use-cases.

So as long as your application continues to get deployed into
environments that never encounter non-English characters, things should
just continue to work.
--
Allen Bauer
CodeGear/Embarcadero
Chief Scientist
http://blogs.codegear.com/abauer
Q Correll
2008-07-25 21:50:37 UTC
Permalink
Allen,

| So as long as your application continues to get deployed into
| environments that never encounter non-English characters, things should
| just continue to work.

Where can I get a pair of those rose-colored glasses? ;-)
--
Q

07/25/2008 14:50:14

XanaNews Version 1.17.5.7 [Q's Salutation mod]
Allen Bauer (CodeGear)
2008-07-25 22:17:19 UTC
Permalink
Post by Q Correll
Allen,
Post by Allen Bauer (CodeGear)
So as long as your application continues to get deployed into
environments that never encounter non-English characters, things
should just continue to work.
Where can I get a pair of those rose-colored glasses? ;-)
My point was that just because you have a potentially more capable
application doesn't mean that it will magically begin to use that
capacity when faced with same use-cases and scenarios it always has. If
a user always just entered English string data into an application,
having a new version of that application that accepts a wider range of
characters will not mean that this same user will just being to enter
(or even know how to enter) non-English data.
--
Allen Bauer
CodeGear/Embarcadero
Chief Scientist
http://blogs.codegear.com/abauer
Q Correll
2008-07-26 00:22:21 UTC
Permalink
Allen,

Yes, I understand.

But my "point" was that things rarely seem to work out as simply as we
expect. <g>
--
Q <keeping fingers crossed>

07/25/2008 17:20:32

XanaNews Version 1.17.5.7 [Q's Salutation mod]
Thorsten Engler [NexusDB]
2008-07-26 02:18:37 UTC
Permalink
Post by Q Correll
Allen,
Yes, I understand.
But my "point" was that things rarely seem to work out as simply as
we expect. <g>
I have to say I'm with Allen here, if your application has only been
confronted with standard english characters in the past then just
switching out that application in the same environment with one which
is unicode enabled you are still only going to be confronted with
standard english characters.



--
Q Correll
2008-07-26 03:42:47 UTC
Permalink
Thorsten,

| I have to say I'm with Allen here, if your application has only been
| confronted with standard english characters in the past then just
| switching out that application in the same environment with one which
| is unicode enabled you are still only going to be confronted with
| standard english characters.

I yield to the gurus! <g>
--
Q

07/25/2008 20:42:17

XanaNews Version 1.17.5.7 [Q's Salutation mod]
Ray Porter
2008-07-26 23:05:46 UTC
Permalink
Post by Thorsten Engler [NexusDB]
I have to say I'm with Allen here, if your application has only been
confronted with standard english characters in the past then just
switching out that application in the same environment with one which
is unicode enabled you are still only going to be confronted with
standard english characters.
One thing I haven't heard made clear yet is possible impact on database
reads/writes. We use Oracle 10G and will eventually move to SQL Server.
I'll ask our DBA but I'm fairly certain our database is currently configured
for ansi (Latin char set). Will I need to ansify our database reads/writes
(or at least the writes) or will the TADODataset and its descendents handle
the situation transparently?

Thanks,
Ray Porter
Thorsten Engler [NexusDB]
2008-07-26 23:19:53 UTC
Permalink
Post by Ray Porter
One thing I haven't heard made clear yet is possible impact on database
reads/writes. We use Oracle 10G and will eventually move to SQL Server. I'll
ask our DBA but I'm fairly certain our database is currently configured for
ansi (Latin char set). Will I need to ansify our database reads/writes (or
at least the writes) or will the TADODataset and its descendents handle the
situation transparently?
All that should just continue to work. You just keep using TField.AsString to
read/write and the classes will take care that whatever ends up in the database
does so in the right format.


--
willr
2008-07-26 16:04:23 UTC
Permalink
Post by Q Correll
Allen,
| So as long as your application continues to get deployed into
| environments that never encounter non-English characters, things should
| just continue to work.
Where can I get a pair of those rose-colored glasses? ;-)
I think they are available at most programming classes -- and "extra
rose coloured" are available at "design" classes.
--
Will R
PMC Consulting
Q Correll
2008-07-26 16:58:32 UTC
Permalink
willr,

| I think they are available at most programming classes -- and "extra
rose coloured" are available at "design" classes.

<chuckle>

Regardless of what Allen and Thorsten feel will be the case, I'd still put
money on the phenomenon of "unintended consequences" popping up. <g> It
may not be with the display of any given character set, but my gut says
something unexpected will surface. Eight million lines of code is very
difficult to keep in the box when one starts messing with it. ;-)
--
Q

07/26/2008 09:53:16

XanaNews Version 1.17.5.7 [Q's Salutation mod]
Thorsten Engler [NexusDB]
2008-07-26 17:12:24 UTC
Permalink
Post by Q Correll
Regardless of what Allen and Thorsten feel will be the case, I'd
still put money on the phenomenon of "unintended consequences"
popping up. <g> It may not be with the display of any given
character set, but my gut says something unexpected will surface.
Eight million lines of code is very difficult to keep in the box when
one starts messing with it. ;-)
I never said that porting to code from ANSI to Unicode will be without
any issues and gotchas.

But that is NOT what Allen was talking about.

What he and I was saying is that if your current ANSI application is
only confronted with simple, single byte characters (coming from Win
API calls, or typed in by the user and so on), then converting your
application to Unicode will NOT suddenly expose you to things like
surrogate pairs.

Only if the environment in which your application runs is changed could
it find itself be confronted with surrogate pairs.

But in any single possible case where your unicode application will be
confronted with surrogate pairs, the same application as ANSI
application would be confronted with double or even multi byte
character sets. Which are much more difficult to handle then Unicode
with surrogate pairs.


--
Q Correll
2008-07-26 17:29:13 UTC
Permalink
Thorsten,

| But that is NOT what Allen was talking about.

I know, I know. <g> Allen used the noun "things." I was just having a
bit of "fun" jabbing at his high optimism. I fully realize the "things"
Allen meant were related to the display of the characters. But there may
be some, if not many, other things that may be affected by the changes.
Especially until Tiburon is significantly debugged. ;-)
--
Q

07/26/2008 10:21:25

XanaNews Version 1.17.5.7 [Q's Salutation mod]
Thorsten Engler [NexusDB]
2008-07-26 17:35:24 UTC
Permalink
Post by Q Correll
I know, I know. <g> Allen used the noun "things." I was just
having a bit of "fun" jabbing at his high optimism. I fully realize
the "things" Allen meant were related to the display of the
characters. But there may be some, if not many, other things that
may be affected by the changes. Especially until Tiburon is
significantly debugged. ;-)
Read Allen's post again. He was specifically talking about what types
of characters your code might get confronted with.

Supporting Unicode is *not* the *cause* for being confronted with new
types of characters.

If, because of some external change, your application should be
confronted with characters it has never seen before, then Unicode will
help you to better cope with them.


I prefer Allen working on Tiburon then being jabbed at for fun ;)

--
Ivan
2008-07-26 17:38:41 UTC
Permalink
To me Q Correll's original post sounded like a joke and the amount of back and forth it generated
added to the fun...
Q Correll
2008-07-26 20:29:46 UTC
Permalink
Ivan,

| To me Q Correll's original post sounded like a joke

Which was my intent.
--
Q

07/26/2008 13:29:04

XanaNews Version 1.17.5.7 [Q's Salutation mod]
Q Correll
2008-07-26 20:28:46 UTC
Permalink
Thorsten,

Sorry I pushed one of your hot buttons. That was NOT my intent.

I think we can let it drop now.
--
Q <Remembering the joke... "...Whew! I don't think I could take a dollar's worth of that!" ;->

07/26/2008 13:25:55

XanaNews Version 1.17.5.7 [Q's Salutation mod]
TJC Support
2008-07-26 20:41:15 UTC
Permalink
Post by Q Correll
I think we can let it drop now.
Aw, c'mon Q, remember where you are. You can't get off _that_ easy! :^)

Cheers,
Van
Q Correll
2008-07-26 21:20:56 UTC
Permalink
TJC,

| Aw, c'mon Q, remember where you are. You can't get off that easy! :^)

<chuckle> Ah, yaaass indeedy,... there is that.
--
Q

07/26/2008 14:19:27

XanaNews Version 1.17.5.7 [Q's Salutation mod]
Nick Hodges (Embarcadero)
2008-07-26 17:32:23 UTC
Permalink
Post by Thorsten Engler [NexusDB]
I never said that porting to code from ANSI to Unicode will be without
any issues and gotchas.
Or, put another way, chances are you won't even notice that you are
using Unicode strings. ;-)
--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges
TJC Support
2008-07-26 20:54:09 UTC
Permalink
Post by Nick Hodges (Embarcadero)
Or, put another way, chances are you won't even notice that you are
using Unicode strings. ;-)
Hi Nick,

I think Q and I both learned a lot of lessons from the school of hard
knocks. And while I don't expect any serious problems to crop up due to
your changes in the way things work, the phrase "there's always something"
comes to mind. :^) I expect there will probably be a few issues to crop up
in two areas. 1 is my own sloppy programming. I built my primary
application over 10 years ago and have maintained and added to it over the
years. I've learned a lot in the last 10 years, and if I started the
project over from scratch, it would be a much better piece of code, both
from the architecture/design standpoint and the code quality standpoint. So
there are probably going to be some of those places in the code where I made
bad decisions about string & character handling that'll rear up their ugly
heads. And 2, I use the old Turbopower libraries extensively, and there's
nobody maintaining those now. I did take a quick look at Systools the other
day, and it looks like they were pretty careful about declaring strings as
AnsiString. I was concerned, because a lot of the string handling routines
are written in assembler, but I think they may be okay. But if I have a
_lot_ of problems with those libraries, I'll probably end up having to
abandon them in favor of more up to date stuff that's maintained.

At any rate, I look forward to my first upgrade since D7. The good news for
me is that I don't have a schedule for switching over, so it can take as
long as it takes to work through the code to get it in shape before I switch
to D2009 for production code.

Cheers,
Van Swofford
Tybee Jet Corp.
Q Correll
2008-07-26 21:30:44 UTC
Permalink
TJC, (I apologize that my simplistic name-capture doesn't come up with
"Van." <g> If you could get your newsreader client to post your signature
with the standard "--" line ahead of "Van Swofford" it might work. <g>)

| I think Q and I both learned a lot of lessons from the school of hard
knocks.

Yep. <g>

| 1 is my own sloppy programming. I built my primary application over 10
years ago and have maintained and added to it over the years. I've
learned a lot in the last 10 years, and if I started the project over from
scratch, it would be a much better piece of code, both from the
architecture/design standpoint and the code quality standpoint. So there
are probably going to be some of those places in the code where I made bad
decisions about string & character handling that'll rear up their ugly
heads.

That's also a ditto for me. ("It just grew and grew and grew..." ;-)

I do think, however, I may have more potential problems with my old
components than my own code. I also use Orpheus4, a TP product as per
your concern #2.
--
Q

07/26/2008 14:22:48

XanaNews Version 1.17.5.7 [Q's Salutation mod]
TJC Support
2008-07-26 22:51:19 UTC
Permalink
Post by Q Correll
TJC, (I apologize that my simplistic name-capture doesn't come up with
"Van." <g> If you could get your newsreader client to post your signature
with the standard "--" line ahead of "Van Swofford" it might work. <g>)
I'm about to switch to Xananews as soon as I get my new machine going,
hopefully in the next week. That _should_ fix it. :^)
Post by Q Correll
That's also a ditto for me. ("It just grew and grew and grew..." ;-)
Hehehe, yep, I know that one.
Post by Q Correll
I do think, however, I may have more potential problems with my old
components than my own code. I also use Orpheus4, a TP product as per
your concern #2.
Yeah, all my UI stuff is Orpheus4, and all my string stuff is Systools. And
"security" is OnGuard. That's in quotes because OG ain't all that secure,
but it's good enough for my audience. And my DB is BTree Filer. You might
say I'm pretty thoroughly TP'd. :^)

Cheers,
Van
Q Correll
2008-07-26 23:48:34 UTC
Permalink
| I'm about to switch to Xananews as soon as I get my new machine going,
hopefully in the next week. That should fix it. :^)

TJC,

Yes, I think it will. ;-)

| You might say I'm pretty thoroughly TP'd. :^)

Yes I would.

I will be working on converting O4. Perhaps we can exchange notes when
the time comes?
--
Q

07/26/2008 16:34:18

XanaNews Version 1.18.1.11 [Leonel's & Q's Mods]
David Erbas-White
2008-07-27 00:03:35 UTC
Permalink
Post by Q Correll
| I'm about to switch to Xananews as soon as I get my new machine going,
hopefully in the next week. That should fix it. :^)
TJC,
Yes, I think it will. ;-)
| You might say I'm pretty thoroughly TP'd. :^)
Yes I would.
I will be working on converting O4. Perhaps we can exchange notes when
the time comes?
I'd be interested in hearing what other components you use (if
third-party) to replace them. I'm also doing what I can to get away
from using the SysTools package...

David Erbas-White

David Erbas-White
2008-07-26 23:25:31 UTC
Permalink
Post by Q Correll
I do think, however, I may have more potential problems with my old
components than my own code. I also use Orpheus4, a TP product as per
your concern #2.
Ditto on the Orpheus and SysTools, and AsyncPro is still one of the best
communications libraries around for serial transfers...

David Erbas-White
David Erbas-White
2008-07-26 17:34:24 UTC
Permalink
Post by willr
Post by Q Correll
Allen,
| So as long as your application continues to get deployed into
| environments that never encounter non-English characters, things should
| just continue to work.
Where can I get a pair of those rose-colored glasses? ;-)
I think they are available at most programming classes -- and "extra
rose coloured" are available at "design" classes.
Yes, but they take them away when you actually complete your first real
'working' program...

David Erbas-White
Thorsten Engler [NexusDB]
2008-07-21 19:16:05 UTC
Permalink
Post by John Herbster
May I presume like this?
Inc(p);
where MyStr: string; and p: PChar;
And how expensive are these operations during CPU execution?
The Inc(p) used to add 1, now it adds 2 to the pointer. What difference
in performance compared to AnsiString/PAnsiChar do you expect?

--
Ian Boyd
2008-07-21 20:08:34 UTC
Permalink
Post by Nick Hodges (Embarcadero)
Post by John Herbster
Show us how to iterate through a string of characters with pointers.
Exactly as before -- but don't assume a character is of size 1.
p: Pointer;
p := @MyString[1];
Inc(p);

?
Nick Hodges (Embarcadero)
2008-07-21 20:34:42 UTC
Permalink
Post by Ian Boyd
p: Pointer;
Inc(p);
That will behave differently, since it (appears) to be assuming the
SizeOf(Char) = SizeOf(Pointer), which is no longer true.
--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges
Thorsten Engler [NexusDB]
2008-07-21 20:35:35 UTC
Permalink
Post by Ian Boyd
p: Pointer;
Inc(p);
You can't do pointer math with untyped pointers. Never worked before.
Not going to start working suddenly. "Pointer" does not know how big
whatever it points to is.

--
Dave Nottage [TeamB]
2008-07-22 10:46:36 UTC
Permalink
Post by Nick Hodges (Embarcadero)
Exactly as before -- but don't assume a character is of size 1.
Post by John Herbster
Show us how to load and store a string from and to TStreams.
Exactly as before but you can't assume that the length of a string
char is 1.
Post by John Herbster
Show us how to pass strings to and from DLLs.
Just as before, but again, don't assume that Char = 1 byte.
So.. does that mean we should not assume that Char = 1 byte? <g>
--
Dave Nottage [TeamB]
John Herbster
2008-07-22 12:31:38 UTC
Permalink
Nick,

Just a few more questions to help with the book you are writing. <g>
Post by Nick Hodges (Embarcadero)
Post by John Herbster
Show us how to iterate through a string of characters with indexes.
Exactly as before.
What about surrogate pairs? Do you consider such a pair to be two
characters or one?
Post by Nick Hodges (Embarcadero)
Post by John Herbster
Show us how to load and store a string from and to TStreams.
Exactly as before but you can't assume that the length of a string
char is 1.
Does the Length() function return the number of characters or number
of words. Will Length(MyUnicodeString)*SizeOf(MyUniCodeString[1])
give the required number of bytes (not including a terminator) in a
TStream?
Post by Nick Hodges (Embarcadero)
Post by John Herbster
Show us how to replace a character.
Exactly as before.
What if we are replacing a pair with a singleton?
Post by Nick Hodges (Embarcadero)
Post by John Herbster
Show us how to make literal constants an assign them to strings.
Exactly as before.
Can we use the U+ representations in constant statements?
Where do we find the or look up some of the more common U+
character definitions.

Regards, JohnH
Thorsten Engler [NexusDB]
2008-07-22 12:48:26 UTC
Permalink
John Herbster wrote:

I'm not Nick, but...
Post by John Herbster
What about surrogate pairs? Do you consider such a pair to be two
characters or one?
There is no difference between surrogate pairs in UTF-16 and
leading/trailing bytes you can already encounter in ANSI if the current
codepage happens to be a DBCS/MBCS. If anything surrogate pairs are
much easier to handle because there is no overlap between singelton,
surrogate leading and trailing units.
Post by John Herbster
Does the Length() function return the number of characters or number
of words.
Just the same as with AnsiStrings or dynamic arrays. It returns the
number of elements.
Post by John Herbster
Will Length(MyUnicodeString)*SizeOf(MyUniCodeString[1])
give the required number of bytes (not including a terminator) in a
TStream?
It should, but that code is dangerous as MyUniCodeString[1] would raise
an exception if the string is empty.
Post by John Herbster
What if we are replacing a pair with a singleton?
Absolutely no different from handling DBCS/MBCS in ANSI already.
Post by John Herbster
Can we use the U+ representations in constant statements?
Where do we find the or look up some of the more common U+
character definitions.
You can use the usual # syntax that delphi uses to specify individual
characters by their byte value. Keep in mind that U+ is Hex and # is
decimal.
Nick Hodges (Embarcadero)
2008-07-22 13:34:58 UTC
Permalink
Post by Thorsten Engler [NexusDB]
I'm not Nick, but...
..you are doing a much better job explaining it than I could. ;-)
--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges
John Herbster
2008-07-22 14:12:44 UTC
Permalink
Post by Thorsten Engler [NexusDB]
Post by John Herbster
What about surrogate pairs? Do you consider such a pair to be two
characters or one?
There is no difference between surrogate pairs in UTF-16 and
leading/trailing bytes you can already encounter in ANSI if the current
codepage happens to be a DBCS/MBCS.
I have never used the DBCS/MBCS code page
-- I think that this may be part of my communication problem.
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Does the Length() function return the number of characters
or number of words.
Just the same as with AnsiStrings or dynamic arrays.
It returns the number of elements.
Is there a standard function returning the number of "characters"?
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Will Length(MyUnicodeString)*SizeOf(MyUniCodeString[1])
give the required number of bytes (not including a terminator) in a
TStream?
It should, but .. MyUniCodeString[1] would raise
an exception if the string is empty.
Of course.
Post by Thorsten Engler [NexusDB]
Post by John Herbster
What if we are replacing a pair with a singleton?
Absolutely no different from handling DBCS/MBCS in ANSI already.
Whatever DBCS/MBCS is. <g>
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Can we use the U+ representations in constant statements?
You can use the usual # syntax that delphi uses to specify individual
characters by their byte value. Keep in mind that U+ is Hex and # is
decimal.
Was that a *no*?
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Where do we find the or look up some of the more common U+
character definitions.
I appreciate your help. Rgds, JohnH
Thorsten Engler [NexusDB]
2008-07-22 14:37:56 UTC
Permalink
Post by John Herbster
Is there a standard function returning the number of "characters"?
Not that I know of. But it's largely pointless. There are hardly any
cases when you really need to concern yourself with surrogate pairs.

If you really want to count the number of codepoints in an UTF-16
string (but what for?) it's pretty straight forward. Start at the
beginning, any singelton counts as 1. Any leading unit must be followed
by a trailing unit or the string is invalid (same is true in reverse,
any trailing unit must be preceded by a leading unit or the string is
invalid). If you found a leading/trailing pair, count it as 1.

But that doesn't tell you how many glyphs that string would have if
it's displayed. There is no 1:1 relationship between codepoints and
glyphs (the things you recognize as individul elements on when
represented visually).

Thing is, in all cases where you are currently not confronted with
double byte character sets or multi byte character sets with ANSI, you
won't be confronted with surrogate pairs either.

And in the overwhelming majority of cases where you ARE confronted with
DBCS/MBCS in ANSI, you are STILL not going to be confronted with
surrogate pairs in UTF-16.
Post by John Herbster
Whatever DBCS/MBCS is. <g>
That's the thing you have to work with in any but the most trivial ANSI
codepages. Double Byte Character Sets, Multi Byte Character Sets.

Pretty much the same as with surrogate pairs. Just that you can have
more then 1 trailing byte. And the possible values of trailing bytes
can overlap with the values of singeltons. So if you don't always parse
the strings from the beginning and react properly to the leading bytes
to interprete the following trailing bytes differently, you might
totally misinterprete the trailing bytes as singletons with totally
different meaning.
Post by John Herbster
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Can we use the U+ representations in constant statements?
You can use the usual # syntax that delphi uses to specify
individual characters by their byte value. Keep in mind that U+ is
Hex and # is decimal.
Was that a no?
That was a "you can use what you've always used" in Delphi.
Post by John Herbster
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Where do we find the or look up some of the more common U+
character definitions.
Start > Accessories > System Tools > Character Map. Knock yourself out
(pay close attention to the status bar).


--
John Herbster
2008-07-22 16:18:48 UTC
Permalink
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Is there a standard function returning the number of "characters"?
Not that I know of. But it's largely pointless. There are hardly any
cases when you really need to concern yourself with surrogate pairs.
If you really want to count the number of codepoints in an UTF-16
...
But that doesn't tell you how many glyphs that string would have if
it's displayed. There is no 1:1 relationship between codepoints and
glyphs (the things you recognize as individul elements on when
represented visually).
Oh -- More definition problems:
What is a "character", if it exists, in this brave new world?
What is a "glyph"?
What is a "codepoint"?
Post by Thorsten Engler [NexusDB]
Thing is, in all cases where you are currently not confronted with
double byte character sets or multi byte character sets with ANSI,
you won't be confronted with surrogate pairs either.
And in the overwhelming majority of cases where you ARE confronted with
DBCS/MBCS in ANSI, you are STILL not going to be confronted with
surrogate pairs in UTF-16.
Post by John Herbster
Whatever DBCS/MBCS is. <g>
That's the thing you have to work with in any but the most trivial ANSI
codepages. Double Byte Character Sets, Multi Byte Character Sets.
Thank goodness for trivial!
Post by Thorsten Engler [NexusDB]
...
Post by John Herbster
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Can we use the U+ representations in constant statements?
You can use the usual # syntax that delphi uses to specify
individual characters by their byte value. Keep in mind that U+ is
Hex and # is decimal.
Was that a no?
That was a "you can use what you've always used" in Delphi.
I conclude for now, that one does not ever need to use "U+"
representations in consts.
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Where do we find the or look up some of the more common U+
character definitions.
Start > Accessories > System Tools > Character Map.
(pay close attention to the status bar).
Thorsten, Thank you.

I hope that Unicode will help us to communicate better. Will
Unicode do anything for decimal points and thousand separators?
Will it help formatting messages in newsgroup readers?

Regards, JohnH
Thorsten Engler [NexusDB]
2008-07-22 16:37:00 UTC
Permalink
Post by John Herbster
What is a "character", if it exists, in this brave new world?
What is a "glyph"?
What is a "codepoint"?
To quote myself:

I would like to strongly recommend that everyone heads over to

http://www.unicode.org/versions/Unicode5.1.0/

And read the Introduction:

http://www.unicode.org/versions/Unicode5.0.0/ch01.pdf

and at least fly over the General Structure:

http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

These 2 chapters together should answer 99% of any question someone
here might have about Unicode.
Post by John Herbster
Thank goodness for trivial!
If you have no intention of supporting anything that you didn't support
so far with using just an english ANSI codepage then unicode is
absolute trivial for you. As far as you concerned the only difference
is basically going to be that chars are now two bytes instead of one
and the high byte is basically always going to be 0 for you.

Even if you want to support any of the complexer scripts, you usually
don't have to concern yourself with any of the details of unicode. The
user enters a string, store it the way it is. You want to display it?
Hand it to the Windows API the way it is. Simple. You want to compare 2
strings? Pass them to the CompareStringW API.

There are a few windows APIs that you might not have used before that
can be helpful:

NormalizeString:
http://msdn.microsoft.com/en-us/library/ms776395(VS.85).aspx

FoldString:
http://msdn.microsoft.com/en-us/library/cc709430(VS.85).aspx

LCMapString/LCMapStringEx:
http://msdn.microsoft.com/en-us/library/ms776290(VS.85).aspx
http://msdn.microsoft.com/en-us/library/ms776387(VS.85).aspx

The parameters might look a bit scary, but I'm pretty sure that Tiburon
will come with some form of nicer wrappers for such things.
Post by John Herbster
I conclude for now, that one does not ever need to use "U+"
representations in consts.
Use #$ instead of U+

--
Jens Mühlenhoff
2008-07-22 16:25:28 UTC
Permalink
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Post by Thorsten Engler [NexusDB]
Post by John Herbster
Can we use the U+ representations in constant statements?
You can use the usual # syntax that delphi uses to specify
individual characters by their byte value. Keep in mind that U+ is
Hex and # is decimal.
Was that a no?
That was a "you can use what you've always used" in Delphi.
To put it another way, #$1A3B (hexadecimal character) is the delphi
representation of U+1A3B. The "U+" notation would really be superflous here.
--
Regards
Jens
Chad Z. Hower aka Kudzu
2008-07-22 14:52:46 UTC
Permalink
John, the concept of what is or is not a "character" is very much
language-specific. So, when asking such questions, it is good to be
very specific about what problem you are trying to solve.
For example - Chinese which a character is a word. And believe it or not,
there are more complex scenarios in other languages...
--
Keep up to date - read the IntraWeb blog!
http://www.atozed.com/intraweb/blog/
Thorsten Engler [NexusDB]
2008-07-22 15:10:48 UTC
Permalink
Post by Chad Z. Hower aka Kudzu
For example - Chinese which a character is a word. And believe it or
not, there are more complex scenarios in other languages...
Actually, you are refering to visual representation here. That's a
Glyph, it's rather common in Unicode that multiple codepoints together
contribute to the visual representation of a single glyph (but again,
cases where you are not confronted with DBCS/MBCS when using ANSI you
are not going to be confronted with this in Unicode either).

I would like to strongly recommend that everyone heads over to

http://www.unicode.org/versions/Unicode5.1.0/

And read the Introduction:

http://www.unicode.org/versions/Unicode5.0.0/ch01.pdf

and at least fly over the General Structure:

http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

These 2 chapters together should answer 99% of any question someone
here might have about Unicode.


--
Chad Z. Hower aka Kudzu
2008-07-22 15:42:44 UTC
Permalink
Post by Thorsten Engler [NexusDB]
Actually, you are refering to visual representation here. That's a
Glyph, it's rather common in Unicode that multiple codepoints together
contribute to the visual representation of a single glyph (but again,
cases where you are not confronted with DBCS/MBCS when using ANSI you
are not going to be confronted with this in Unicode either).
I know thats the case for Simplified, is it the same for traditional?
--
Keep up to date - read the IntraWeb blog!
http://www.atozed.com/intraweb/blog/
Thorsten Engler [NexusDB]
2008-07-22 16:15:40 UTC
Permalink
Post by Chad Z. Hower aka Kudzu
I know thats the case for Simplified, is it the same for traditional?
I wasn't specifically talking about Traditional and Simplified Chinese
there but Unicode and complex scripts in general.

There exists about 70000 codepoints for fully composed Han Ideographs.
But there are also mechanisms for decompoising them into individual
components (up to 16 can make up a single composed Ideograph) as well
as many more Ideographs which can only be described in a decomposed
fashion using Ideographic Description Sequences.

Both Simplified and Traditional Chinese (as well as Japanese, Korean
and to some limit Vietnamese) take their visual representations from
these Han Ideographs, with many Ideographs used most of these languages
(but not always with the same meaning).

As far as Unicode is concerned there are no specific traditional or
simplified characters. Most Ideographs used in Simplified Chinese are
also used in Traditional Chinese (There are some Ideographs used in
Simplified which are, well, simplified, versions of Traditional
Ideographs, but mostly the simplification is based on using a single
Ideograph to represent the meaning of different Traditional Ideographs).

There are a lot more different Ideographs used in Traditional Chinese,
so it's more often in Traditional then Simplified that no precomposed
codepoint exist and a single Ideograph needs to be represented with a
Ideographic Description Sequence.

Cheers,
Thorsten

--
Craig Stuntz [TeamB]
2008-07-22 14:47:11 UTC
Permalink
Post by John Herbster
Is there a standard function returning the number of "characters"?
John, the concept of what is or is not a "character" is very much
language-specific. So, when asking such questions, it is good to be
very specific about what problem you are trying to solve.
--
Craig Stuntz [TeamB] · Vertex Systems Corp. · Columbus, OH
Delphi/InterBase Weblog : http://blogs.teamb.com/craigstuntz
Please read and follow Borland's rules for the user of their
server: http://support.borland.com/entry.jspa?externalID=293
Remy Lebeau (TeamB)
2008-07-22 17:29:39 UTC
Permalink
Post by John Herbster
Whatever DBCS/MBCS is. <g>
MBCS = Multi-Byte Character Set

DBCS = Double-Byte Character Set.


Gambit
Remy Lebeau (TeamB)
2008-07-22 17:26:43 UTC
Permalink
Post by John Herbster
What about surrogate pairs? Do you consider such a pair
to be two characters or one?
When indexing through a String, they are separate *physical* characters -
they are each stored in a separate WideChar, and can thus be directly
indexed individually. To retreive a *logical* character (aka a Unicode
codepoint), then they have to be interpretted together. This is no
different than using MBCS or DBCS in AnsiString.
Post by John Herbster
Does the Length() function return the number of characters
or number of words.
The meaning of Length() has not changed. It has always returned the number
of *physical* entries, not the number of *logical* entities. That is the
same whether it s used with an AnsiString, UnicodeString, dynamic array,
etc. In the case of a UnicodeString, that means the number of physically
allocated WideChar entries, not the number of encoded Unicode codepoints.
Post by John Herbster
Will Length(MyUnicodeString)*SizeOf(MyUniCodeString[1]) give
the required number of bytes (not including a terminator) in a TStream?
Yes.
Post by John Herbster
Post by John Herbster
Show us how to replace a character.
Exactly as before.
What if we are replacing a pair with a singleton?

Then you would have to pull the string apart and put it back together
manually - the same as you would have had to do with MBCS and DBCS in
AnsiString. For example:

if IsHighSurrogate(TheString[Index]) then
TheString := Copy(TheString, 1, Index-1) + TheSingleton +
Copy(TheString, Index+2, MaxInt)
else if IsLowSurrogate(TheString[Index]) then
TheString := Copy(TheString, 1, Index-2) + TheSingleton +
Copy(TheString, Index+1, MaxInt);
else
TheString[Index] := TheSinglegon;
Post by John Herbster
Can we use the U+ representations in constant statements?
I am not sure, but I think UnicodeString constants have to be UTF-16
compliant. Unless you specify them in UCS4/UTF-32 and let the compiler
convert for you at runtime.
Post by John Herbster
Where do we find the or look up some of the more common U+ character
definitions.
The Unicode standard.


Gambit
Lucian Radulescu
2008-07-22 13:07:47 UTC
Permalink
If I have old code like:

NameRecType = packed record
Gender : Byte;
SurName : String[70];
FirName : String[20];
MidNames: String[30];
DOB : TDateTime;
end;

and I blockread/blockwrite vars of NameRecType from untyped files, do I
need to change anything or it will just compile and I don't have to
worry about it?

Lucian
Thorsten Engler [NexusDB]
2008-07-22 13:20:52 UTC
Permalink
Post by Lucian Radulescu
NameRecType = packed record
Gender : Byte;
SurName : String[70];
FirName : String[20];
MidNames: String[30];
DOB : TDateTime;
end;
and I blockread/blockwrite vars of NameRecType from untyped files, do
I need to change anything or it will just compile and I don't have to
worry about it?
The shortstring type has not changed in any way.

--
Serge Dosyukov (Dragon Soft)
2008-07-21 18:24:40 UTC
Permalink
1) Few functions are expecting PAnsiChar/PWideChar instead of
AnsiCar/WideChar (windows API)
2) working with Windows API, be aware of what you are passing around
(windows messages)
3) use Length()
4) P-strings are still #0 terminated, but instead of #00, you might see
#0000.

In Delphi 7
var
LC: char;
LC2: widechar;
LC3: ansichar;
begin
ShowMessage(IntToStr(SizeOf(LC)) + ', ' + IntToStr(SizeOf(LC2)) + ', ' +
IntToStr(SizeOf(LC3)));
end;

gives "1, 2, 1"

where now you may get

gives "2, 2, 1"
Post by Lee Jenkins
Has anyone posted information concerning do's and dont's for
Unicode support in upcoming Delphi versions?
John Herbster
2008-07-21 18:32:23 UTC
Permalink
Post by Serge Dosyukov (Dragon Soft)
1) Few functions are expecting PAnsiChar/PWideChar instead of
AnsiCar/WideChar (windows API)
What are the type names for Unicode strings and chars?
What is the SizeOf() for a Unicode char variable?
--JohnH
Nick Hodges (Embarcadero)
2008-07-21 18:47:11 UTC
Permalink
Post by John Herbster
What are the type names for Unicode strings and chars?
string aliases to UnicodeString
PChar aliases to PWideChar
Post by John Herbster
What is the SizeOf() for a Unicode char variable?
SizeOf(Char) is now 2.
--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges
John Herbster
2008-07-21 19:01:27 UTC
Permalink
Post by Nick Hodges (Embarcadero)
Post by John Herbster
What is the SizeOf() for a Unicode char variable?
SizeOf(Char) is now 2.
Not up to 4 bytes?
How can you encode 100,000 characters in only 2-bytes when 2^(2*8) = 65536?
See
http://en.wikipedia.org/wiki/Unicode

If you mean UTF-8, why not call it UTF-8?

--JohnH
Tim Young [Elevate Software]
2008-07-21 19:16:47 UTC
Permalink
John,

<< If you mean UTF-8, why not call it UTF-8? >>

It's UTF-16 (Word-sized characters), the same as with Windows 2000 and
later. It covers most of the character sets out there, but requires
surrogate pairs for more extensive character sets.
--
Tim Young
Elevate Software
www.elevatesoft.com
John Herbster
2008-07-21 19:20:42 UTC
Permalink
Post by Tim Young [Elevate Software]
It's UTF-16 (Word-sized characters), the same as with Windows 2000 and
Tim,

Let's try to pin some definitions down.

According to http://en.wikipedia.org/wiki/UTF-16

"UTF-16 (16-bit Unicode Transformation Format) is a variable-length
character encoding for Unicode"

If Windows and the new Delphi really do use UTF-16, how do they
handle the variable-length character encodings?

Rgds, JohnH
Thorsten Engler [NexusDB]
2008-07-21 19:33:14 UTC
Permalink
Post by John Herbster
"UTF-16 (16-bit Unicode Transformation Format) is a variable-length
character encoding for Unicode"
If Windows and the new Delphi really do use UTF-16, how do they
handle the variable-length character encodings?
In pretty much the same way that windows and delphi handle MBCS ANSI
codepages currently.

See http://en.wikipedia.org/wiki/Multi-byte_character_set

"UTF-16 was devised to break free of the 65,536-character limit of the
original Unicode (1.x) without breaking compatibility with the 16-bit
encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF,
lead units the range D800-DBFF and trail units the range DC00-DFFF. The
lead and trail units, called in Unicode terminology high surrogates and
low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making
for a maximum of possible 1,114,112 codepoints in Unicode."

--
Remy Lebeau (TeamB)
2008-07-21 20:58:27 UTC
Permalink
Post by John Herbster
Not up to 4 bytes?
No. "Char" will now be an alias for WideChar, wheras it was an alias for
AnsiChar in previous versions. Thus SizeOf(Char) will be 2 now.
Post by John Herbster
How can you encode 100,000 characters in only 2-bytes when
2^(2*8) = 65536?
The new UnicodeString type will use UTF-16 (just like WideString does) in
order to match how Windows implements Unicode.

In UTF-16, Unicode code points (logical characters) less than $10000 can be
encoded using their original value as-is in a single WideChar. Unicode
codepoints above $10000, inclusive, have to be encoded as two WideChars
working together (known as a "surrogate pair"). The use of surrogate pairs
allows UTF-16 to support up to 2,097,152 Unicode codepoints. Anything more
than that requires UTF-32 instead. Which Tiburon will also support, via a
separate UCS4String (and UCS4Char) data type, which are 32-bit.


Gambit
Remy Lebeau (TeamB)
2008-07-21 21:23:58 UTC
Permalink
The use of surrogate pairs allows UTF-16 to support up to
2,097,152 Unicode codepoints.
Correction: UTF-16 supports 1,112,064 Unicode codepoints ($00000000 -
$0010FFFF, minus $0000D800 - $0000DFFF which are reserved).


Gambit
John Herbster
2008-07-21 21:39:30 UTC
Permalink
Remy, Thorsten, et. al,
"Char" will now be an alias for WideChar, ...
Thus SizeOf(Char) will now be 2.
Thanks for that info.
Post by John Herbster
How can you encode 100,000 characters in only 2-bytes
when 2^(2*8) = 65536?
The new UnicodeString type will use UTF-16 (just like WideString does) in
order to match how Windows implements Unicode.
In UTF-16, Unicode code points (logical characters) less than $10000 can be
encoded using their original value as-is in a single WideChar. Unicode
codepoints above $10000, inclusive, have to be encoded as two WideChars
working together (known as a "surrogate pair"). The use of surrogate pairs
allows UTF-16 to support up to 2,097,152 Unicode codepoints. Anything more
than that requires UTF-32 instead. Which Tiburon will also support, via a
separate UCS4String (and UCS4Char) data type, which are 32-bit.
Then for "surrogate pairs" which require two WideChars for their
representation, it seems to be that "exactly as before" character
indexing will require sometimes stepping over two WideChars instead
of one.

Are the individual WideChars stored big or little endian?
If little endian in Intel RAM, how are they stored in disk "text"
files and communicated over wires?

What about the surrogate pairs? Is the low or high part of the pair
at the lower address? And ditto for disk files and communications?
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode.
Retrieved from "http://en.wikipedia.org/wiki/Variable-width_encoding"

Does that mean that UTF-16 characters are limited to 4-bytes?

TIA for the education, JohnH
Thorsten Engler [NexusDB]
2008-07-21 21:55:43 UTC
Permalink
Post by John Herbster
Then for "surrogate pairs" which require two WideChars for their
representation, it seems to be that "exactly as before" character
indexing will require sometimes stepping over two WideChars instead
of one.
UTF16 has the huge advantage that the values for singeltons and leading
and trailing surrogate pairs do not overlap:

"In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead
units the range D800-DBFF and trail units the range DC00-DFFF".

As a result of this, for code like "split this string into individual
strings at each \" and a lot of other string processing that's
happening on a per character basis, you don't have to worry about the
surrogate pairs because the the trailing unit can never be mistaken for
some other valid character.
Post by John Herbster
Are the individual WideChars stored big or little endian?
In memory, usually whatever your current hardware platform perfers.
Post by John Herbster
If little endian in Intel RAM, how are they stored in disk "text"
files and communicated over wires?
That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark

All UTF16 strings that go "over the wire" or onto disk should be
prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the byte order
of the following data.
Post by John Herbster
What about the surrogate pairs? Is the low or high part of the pair
at the lower address? And ditto for disk files and communications?
The order of the surrogate pairs always remains the same, the leading
one comes before the trailing one.
Post by John Herbster
Does that mean that UTF-16 characters are limited to 4-bytes?
That's why they are called "surrogate pairs" and not "surrogate
sequences" or something like that. You either have a singelton or a
pair of a leading and trailing surrogate.


--
John Herbster
2008-07-21 22:37:45 UTC
Permalink
UTF16 has the huge advantage that the values for singletons and leading
I see!
Post by John Herbster
Are the individual WideChars stored big or little endian?
... usually whatever your current hardware platform prefers.
Am I correct that "U+" is just a prefix indicating for Unicode
representation in hexadecimal. Is U+D801DC01 a surrogate pair?
Or is it written U+D801, U+DC01?
Post by John Herbster
If little endian in Intel RAM, how are they stored in disk
"text" files and communicated over wires?
That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark
All UTF16 strings that go "over the wire" or onto disk should be
prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the
byte order of the following data.
I do not understand this. If I have MyAnsiString = 'AB' and assign
it to MyWideString in RAM on a PS with an Intel CPU, then I presume
that I have in increasing memory addresses $41, $00, $42, and $00,
or if you please U+0041 and U+0042.

Now if I sent this to a file, is this byte sequence valid?
Big-endian: $FE, $FF, $41, $00, $42, $00
And this one valid, too?
Little-endian: $FF, $FE, $00, $41, $00, $42
And if so, wouldn't the U+ representation in either case be
U+FEFF, U+0041, U+0042.

TIA, JohnH
John Herbster
2008-07-21 22:43:05 UTC
Permalink
(Correction)
UTF16 has the huge advantage that the values for singletons and leading
I see!
Post by John Herbster
Are the individual WideChars stored big or little endian?
... usually whatever your current hardware platform prefers.
Am I correct that "U+" is just a prefix indicating for Unicode
representation in hexadecimal. Is a surrogate pair written
U+D801, U+DC01?
Post by John Herbster
If little endian in Intel RAM, how are they stored in disk
"text" files and communicated over wires?
That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark
All UTF16 strings that go "over the wire" or onto disk should be
prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the
byte order of the following data.
I do not understand this. If I have MyAnsiString = 'AB' and assign
it to MyWideString in RAM on a PS with an Intel CPU, then I presume
that I have in increasing memory addresses $41, $00, $42, and $00,
or if you please U+0041 and U+0042.

Now if I sent this to a file, is this byte sequence valid?
Big-endian: $FE, $FF, $00, $41, $00, $42
And this one valid, too?
Little-endian: $FF, $FE, $41, $00, $42, $00
And if so, wouldn't the U+ representation in either case be
U+FEFF, U+0041, U+0042.

TIA, JohnH
Thorsten Engler [NexusDB]
2008-07-21 22:55:09 UTC
Permalink
Post by John Herbster
Am I correct that "U+" is just a prefix indicating for Unicode
representation in hexadecimal. Is a surrogate pair written
U+D801, U+DC01?
Yes.
Post by John Herbster
I do not understand this. If I have MyAnsiString = 'AB' and assign
it to MyWideString in RAM on a PS with an Intel CPU, then I presume
that I have in increasing memory addresses $41, $00, $42, and $00,
or if you please U+0041 and U+0042.
Yes.
Post by John Herbster
Now if I sent this to a file, is this byte sequence valid?
Big-endian: $FE, $FF, $00, $41, $00, $42
Yes.
Post by John Herbster
And this one valid, too?
Little-endian: $FF, $FE, $41, $00, $42, $00
Yes.
Post by John Herbster
And if so, wouldn't the U+ representation in either case be
U+FEFF, U+0041, U+0042.
Yes, I was mistaken. It is always U+FEFF, which can be FF FE or FE FF
depending on the endianess, I should't have used U+FFFE except to say
that "The Unicode value U+FFFE is guaranteed never to be assigned as a
Unicode character; this implies that in a Unicode context the 0xFF,
0xFE byte pattern can only be interpreted as the U+FEFF character
expressed in little-endian byte order (since it could not be a U+FFFE
character expressed in big-endian byte order)."

--
Remy Lebeau (TeamB)
2008-07-21 23:44:58 UTC
Permalink
"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message news:488510ec$***@newsgroups.borland.com...
(Correction)
Post by John Herbster
Am I correct that "U+" is just a prefix indicating for
Unicode representation in hexadecimal.
Yes.
Post by John Herbster
If I have MyAnsiString = 'AB' and assign it to MyWideString
in RAM on a PS with an Intel CPU, then I presume that I have
in increasing memory addresses $41, $00, $42, and $00
Yes. That would be UTF-16 in Little Endian.
Post by John Herbster
Now if I sent this to a file, is this byte sequence valid?
Big-endian: $FE, $FF, $00, $41, $00, $42

Yes.
Post by John Herbster
And this one valid, too?
Little-endian: $FF, $FE, $41, $00, $42, $00
Yes.
Post by John Herbster
And if so, wouldn't the U+ representation in either case
be
U+FEFF, U+0041, U+0042.
Yes, it would.


Gambit
Ivan
2008-07-21 23:10:41 UTC
Permalink
Post by Thorsten Engler [NexusDB]
UTF16 has the huge advantage that the values for singeltons and leading
Now the advantage over utf8 is finally becoming clear. Thanks so much Thorsten, very helpful as usual.
Pieter Zijlstra
2008-07-21 22:01:27 UTC
Permalink
Post by John Herbster
Then for "surrogate pairs" which require two WideChars for their
representation, it seems to be that "exactly as before" character
indexing will require sometimes stepping over two WideChars instead
of one.
It is the same as before where multiple bytes where needed to display
one character in for instance Asian windows versions. Most of the time
you don't care you just read/write a number of bytes (with Unicode,
words) and leave it to the Windows API how this is displayed.
--
Pieter
Remy Lebeau (TeamB)
2008-07-21 23:35:35 UTC
Permalink
Post by John Herbster
Then for "surrogate pairs" which require two WideChars for
their representation, it seems to be that "exactly as before"
character indexing will require sometimes stepping over
two WideChars instead of one.
Potentially. But that requirement has existed since WideString was
introduced. It does not change now that UnicodeString is being added. If
you don't need to act on individual codepoints in your code, then you don't
have to worry about treating surrogates separately. Otherwise, you
generally would have to convert from UTF-16 to UTF-32 before you could work
with codepoints correctly anyway.
Post by John Herbster
Are the individual WideChars stored big or little endian?
WideString and UnicodeString use Big Endian, as that is the default endian
for Intel platforms.
Post by John Herbster
If little endian in Intel RAM, how are they stored in disk "text"
files and communicated over wires?
It is the coder's responsibility to handle endian issues in those cases.
That is nothing new.
Post by John Herbster
What about the surrogate pairs? Is the low or high part of the pair
at the lower address?
The High surrogate always appears in front of the Low surrogate in the
string, but each individual surrogate in the pair is affected by the endian
used for the entire string. This is clearly described in RFC 2781.
Post by John Herbster
And ditto for disk files and communications?
That is also the coder's responsibility to handle.
Post by John Herbster
Does that mean that UTF-16 characters are limited to 4-bytes?
Unicode itself is limited to 4 bytes per codepoint (encoded using UTF-32
and/or UCS4). There is no codepoint defined above $7FFFFFFF yet.

However, UTF-16 is limited to 3-byte codepoints, since the highest codepoint
it can handle is $10FFFF.


Gambit
Thorsten Engler [NexusDB]
2008-07-21 23:56:10 UTC
Permalink
Post by Remy Lebeau (TeamB)
WideString and UnicodeString use Big Endian, as that is the default
endian for Intel platforms.
Eh. Little-endian is default on x86. Lowest byte first.
Which is why U+0041 will be $41 $00 in memory. But it doesn't really
matter much either way because in most cases you are not going to
access unicode strings byte by byte.


--
Xavier
2008-07-22 10:43:43 UTC
Permalink
Post by Remy Lebeau (TeamB)
Unicode itself is limited to 4 bytes per codepoint (encoded using UTF-32
and/or UCS4). There is no codepoint defined above $7FFFFFFF yet.
There are no code points defined above $10FFFF.
Post by Remy Lebeau (TeamB)
However, UTF-16 is limited to 3-byte codepoints, since the highest codepoint
it can handle is $10FFFF.
Planes 3 through 13 ($30000–$DFFFF, 720895 code point slots) are
currently unallocated. That's most of the Unicode space, and much of it
is likely to never be filled.
Remy Lebeau (TeamB)
2008-07-22 17:35:37 UTC
Permalink
Post by Xavier
There are no code points defined above $10FFFF.
Why do UTF-32 and UCS4 exist if all available codepoints fit within the
UTF-16 space?
Post by Xavier
Planes 3 through 13 ($30000–$DFFFF, 720895 code point slots)
are currently unallocated. That's most of the Unicode space, and
much of it is likely to never be filled.
Older UTF-8 specs I have seen defined rules for handling codepoints up to
$7FFFFFFF (encoded up to 6 bytes in UTF-8). I just looked at the latest RFC
for UTF-8 and it has dropped any mention of codepoints above $10FFFF (thus
encoding up to 4 bytes instead). Interesting.


Gambit
Xavier
2008-07-23 00:31:08 UTC
Permalink
Post by Remy Lebeau (TeamB)
Post by Xavier
There are no code points defined above $10FFFF.
Why do UTF-32 and UCS4 exist if all available codepoints fit within the
UTF-16 space?
Because they allow encoding any code point with a single code unit.
Thorsten Engler [NexusDB]
2008-07-23 00:40:36 UTC
Permalink
Codepoints, code points, code units, elements, glyphs, physical,
logical, words, bytes, surrogates and other pairs, leading, trailing,
and occasionally characters. I need a glossary.
Again, might I refere to www.unicode.org ?
They have a pretty decent documentation.

--
John Herbster
2008-07-23 00:38:16 UTC
Permalink
Codepoints, code points, code units, elements, glyphs, physical,
logical, words, bytes, surrogates and other pairs, leading, trailing,
and occasionally characters. I need a glossary.
TJC Support
2008-07-23 00:58:28 UTC
Permalink
Codepoints, code points, code units, elements, glyphs, physical,
logical, words, bytes, surrogates and other pairs, leading, trailing,
and occasionally characters. I need a glossary.
I need a beer.... :^)

Cheers,
Van
Q Correll
2008-07-23 01:07:13 UTC
Permalink
TJC,

| I need a beer.... :^)

Passing a virtual XX to TJC... <clink>
--
Q

07/22/2008 18:06:49

XanaNews Version 1.17.5.7 [Q's Salutation mod]
TJC Support
2008-07-23 01:34:22 UTC
Permalink
Post by Q Correll
TJC,
| I need a beer.... :^)
Passing a virtual XX to TJC... <clink>
Ah, that's better! And after a 2 day debugging session, too. Thanks!

Van
Maël Hörz
2008-07-23 18:53:06 UTC
Permalink
Post by Remy Lebeau (TeamB)
Why do UTF-32 and UCS4 exist if all available codepoints fit within
the UTF-16 space?
Because it might be more efficient (for example indexing) and easier to
use since you have not to deal with surrogates and basically a DWORD = a
char.
Post by Remy Lebeau (TeamB)
Older UTF-8 specs I have seen defined rules for handling codepoints
up to $7FFFFFFF (encoded up to 6 bytes in UTF-8). I just looked at
the latest RFC for UTF-8 and it has dropped any mention of codepoints
above $10FFFF (thus encoding up to 4 bytes instead). Interesting.
If I remember correctly, it was about how much can be technically
encoded using UTF-8 and what part of this possible range will really be
used. And since UTF-32/16/8 should be able to encode the same Unicode
codepoints a range was chosen to be able to represent them in any UTF
encoding.
Remy Lebeau (TeamB)
2008-07-23 19:45:50 UTC
Permalink
Post by Remy Lebeau (TeamB)
Older UTF-8 specs I have seen defined rules for handling codepoints
up to $7FFFFFFF (encoded up to 6 bytes in UTF-8).
I double-checked and I was thinking of the UTF-8 encoding originally defined
in ISO/IEC 10646. UTF-8 was later updated in RFC 3629 to limit the valid
range of codepoints to match the formal definition in the Unicode standard,
which limits codepoints to a max of $10FFFF. So realistically, UTF-8
encoding of either UCS or Unicode will never have more than 4 bytes, but
ISO/IEC 10646 did define encoding rules for extra codepoints that have since
been restricted in Unicode and are no longer valid.


Gambit
Serge Dosyukov (Dragon Soft)
2008-07-21 18:58:37 UTC
Permalink
http://blogs.codegear.com/nickhodges

1) string, char
2) We are still sit on top of Windows API, so "Wide strings consist of
16-bit Unicode characters". Could be different for 64bit processors.

But "WideChar would suddenly grow in size"

http://en.wikipedia.org/wiki/Unicode
http://delphi.about.com/od/beginners/l/aa071800a.htm
http://www.codexterity.com/delphistrings.htm

As you can see from my code sample, I was getting widechar/widestring
representation: in case of char, it is a widechar, in case of the string it
is a widestring.

Rule of thumb. Stay away from assuming specific size of the string
representation in bytes, count its length in chars instead. Then if you need
exact size, multiply it by the size of the char being stored.
Post by Serge Dosyukov (Dragon Soft)
1) Few functions are expecting PAnsiChar/PWideChar instead of
AnsiCar/WideChar (windows API)
What are the type names for Unicode strings and chars?
What is the SizeOf() for a Unicode char variable?
--JohnH
Chad Z. Hower aka Kudzu
2008-07-22 04:13:31 UTC
Permalink
http://www.kudzuworld.com/blogs/tech/20080722A.aspx
--
Keep up to date - read the IntraWeb blog!
http://www.atozed.com/intraweb/blog/
Post by Lee Jenkins
Has anyone posted information concerning do's and dont's for Unicode
support in upcoming Delphi versions?
It recent threads concerning Delphi/Unicode, I think the topic of being
prepared for Unicode has not been addressed so much, at least as far as I
can see.
On one side, we have applications that have already been written whose
authors are rightfully concerned about compatibility.
On the other side, we have applications which are yet to be written and do
not have much threat of being
In the middle, we have applications which are currently being written
(raises hand) which could benefit from some suggestions on best practices
to give the applications currently being written to have a chance of being
ported more easily when D2008/9 is finally released.
--
Warm Regards,
Lee
Lee Jenkins
2008-07-22 04:23:27 UTC
Permalink
Post by Chad Z. Hower aka Kudzu
http://www.kudzuworld.com/blogs/tech/20080722A.aspx
Hmmm. So how would things like the following be done?

begin
FInStream.Seek(0, soFromBeginning);
SetLength(lString, FInStream.Size);
FInStream.Read(lstring[1], FInStream.Size);
end;

Thanks,

--
Warm Regards,

Lee
Xavier
2008-07-22 11:05:49 UTC
Permalink
Post by Lee Jenkins
Post by Chad Z. Hower aka Kudzu
http://www.kudzuworld.com/blogs/tech/20080722A.aspx
Hmmm. So how would things like the following be done?
begin
FInStream.Seek(0, soFromBeginning);
SetLength(lString, FInStream.Size);
FInStream.Read(lstring[1], FInStream.Size);
end;
Urgh. I hope you understand how bad this is and that it's just a dirty
example. Issue #1 is reading a full stream into a string. RAM be damned.
Issue #2 is reading the Size property once for allocation and once for
the read; files can change size between calls to GetSize[1]. I trust I
don't need to write what happens if you try to read 1MB into a 512KB
string.

But if you must:

FInStream.Seek(0, soFromBeginning);
SetLength(lString, FInStream.Size * SizeOf(Char));
FInStream.Read(lstring[1], FInStream.Size);

Storing the size on a variable would be safer though not exactly better.

[1] for file streams it's even expensive; TStream.GetSize calls Seek
*3 times* for *each* call. And it is SLOW. When I was first learning
streams I made the mistake of referring to Size and Position often (both
of them call Seek). With their use removed the only bottleneck was the
HD throughput.
Henrick Hellström
2008-07-22 11:18:26 UTC
Permalink
Post by Lee Jenkins
FInStream.Seek(0, soFromBeginning);
SetLength(lString, FInStream.Size * SizeOf(Char));
FInStream.Read(lstring[1], FInStream.Size);
Better:

SetLength(lString, FInStream.Seek(0, soFromEnd) div SizeOf(Char));
FInStream.Seek(0, soFromBeginning);
FInStream.Read(Pointer(lString)^, Length(lString)*SizeOf(Char));

Firstly, such code should account for the possibility that the stream is
empty. Referencing lString[1] when Length(lString) = 0 is an error.
Referencing zero bytes from nil^ is however perfectly legal.

Secondly, getting your * and div operators right might help. ;)

Thirdly, adjusting the code in accordance with your Size and Position
remarks wasn't that hard. :)
Maël Hörz
2008-07-22 13:31:10 UTC
Permalink
Post by Xavier
Urgh. I hope you understand how bad this is and that it's just a dirty
example. Issue #1 is reading a full stream into a string. RAM be damned.
Issue #2 is reading the Size property once for allocation and once for
the read; files can change size between calls to GetSize[1].
When reading a file you should lock it anyway to get consistent data.
fmShareDenyWrite will ensure nothing goes wrong and the size stays the same.

BTW I never had bottlenecks due to the use of position or size
properties, maybe you read chunks which are too small in size?
Lee Jenkins
2008-07-22 14:16:45 UTC
Permalink
Post by Lee Jenkins
FInStream.Seek(0, soFromBeginning);
SetLength(lString, FInStream.Size * SizeOf(Char));
FInStream.Read(lstring[1], FInStream.Size);
Storing the size on a variable would be safer though not exactly better.
[1] for file streams it's even expensive; TStream.GetSize calls Seek
*3 times* for *each* call. And it is SLOW. When I was first learning
streams I made the mistake of referring to Size and Position often (both
of them call Seek). With their use removed the only bottleneck was the
HD throughput.
Great stuff. Thank you,

--
Warm Regards,

Lee
Chad Z. Hower aka Kudzu
2008-07-22 13:36:41 UTC
Permalink
Post by Lee Jenkins
Hmmm. So how would things like the following be done?
begin
FInStream.Seek(0, soFromBeginning);
SetLength(lString, FInStream.Size);
FInStream.Read(lstring[1], FInStream.Size);
end;
Aside from the comments posted about your general technique, you should not
do this with Unicode anyways. This assumes a read of "raw" UTF-16. With
Unicode anytime you convert from binary to string or the other way you
should always specify (or in some cases use a function that can determine
for you) the source (or destination if writing) encoding. Tiburon like .NET
will no doubt have many such functions.

Example 1

Expect that Streams that allow reading/writing using strings will have
optional parameters to specify encoding type of the binary, and if none will
default to ANSI.

Example 2

Expect that there will be conversino routines - ie ability to convert
strings to binary streams/arrays using ANSI, UTF-8, and more.

The key point is that any time you go from string to binary you should never
read the memory directly anymore, but instead use a function to do the
conversion and vice versa.
--
Keep up to date - read the IntraWeb blog!
http://www.atozed.com/intraweb/blog/
Lee Jenkins
2008-07-22 14:14:57 UTC
Permalink
Post by Chad Z. Hower aka Kudzu
Post by Lee Jenkins
Hmmm. So how would things like the following be done?
begin
FInStream.Seek(0, soFromBeginning);
SetLength(lString, FInStream.Size);
FInStream.Read(lstring[1], FInStream.Size);
end;
Aside from the comments posted about your general technique, you should not
do this with Unicode anyways. This assumes a read of "raw" UTF-16. With
Unicode anytime you convert from binary to string or the other way you
should always specify (or in some cases use a function that can determine
for you) the source (or destination if writing) encoding. Tiburon like .NET
will no doubt have many such functions.
That would be nice. Like TStringBuilder...?
Post by Chad Z. Hower aka Kudzu
Example 1
Expect that Streams that allow reading/writing using strings will have
optional parameters to specify encoding type of the binary, and if none will
default to ANSI.
Example 2
Expect that there will be conversino routines - ie ability to convert
strings to binary streams/arrays using ANSI, UTF-8, and more.
The key point is that any time you go from string to binary you should never
read the memory directly anymore, but instead use a function to do the
conversion and vice versa.
What methods are available to do this now? I know I can create a TStringList
for instance and use its load/save from/to stream methods, but are there
specific methods or classes to convert between these now?


--
Warm Regards,

Lee
Chad Z. Hower aka Kudzu
2008-07-22 14:57:18 UTC
Permalink
Post by Lee Jenkins
Post by Chad Z. Hower aka Kudzu
Unicode anytime you convert from binary to string or the other way you
should always specify (or in some cases use a function that can determine
for you) the source (or destination if writing) encoding. Tiburon like
.NET will no doubt have many such functions.
That would be nice. Like TStringBuilder...?
No. StringBuilder is for working with text in strings - becuase in .NET
strings cannot be changed. So StringBuilder is a class you can use when you
want to do string manipulations and extract a string when done.

Imagine something like:

MyBytes = ASCIIEncoding.GetBytes(MyString);

Thats how .NET does it. Tiburon might do something like that, or maybe
soemthing like:

MyBytes := UnicodeStringToASCIIBytes(MyString);

or maybe it takes a parameter that specifes what encoding to use. ASCII,
ANSI, UTF-8, UTF-32, etc...
Post by Lee Jenkins
What methods are available to do this now? I know I can create a
TStringList for instance and use its load/save from/to stream methods, but
are there specific methods or classes to convert between these now?
Now as in <= Delphi 2007 or Tiburon? Surely Tiburon has methods, but I can't
discuss them except in generic was as I did above.

In <= Delphi 2007 there isnt Unicode so not really.... There are some
sources here and there that do some encodings from widestrings etc. Indy did
some too, but not Unicode until Tiburon.
--
Keep up to date - read the IntraWeb blog!
http://www.atozed.com/intraweb/blog/
Lee Jenkins
2008-07-22 17:17:11 UTC
Permalink
Post by Chad Z. Hower aka Kudzu
In <= Delphi 2007 there isnt Unicode so not really.... There are some
sources here and there that do some encodings from widestrings etc. Indy did
some too, but not Unicode until Tiburon.
Sorry Chad, should have been more specific. I was referring to saving a stream
to a string. I hadn't run across any objects to do that (other than
TStringList, etc). I was just curious.

Thanks,

--
Warm Regards,

Lee
Chad Z. Hower aka Kudzu
2008-07-22 17:27:37 UTC
Permalink
Post by Lee Jenkins
Sorry Chad, should have been more specific. I was referring to saving a
stream to a string. I hadn't run across any objects to do that (other
than TStringList, etc). I was just curious.
In such existing methods, in absense of passing new parameters or using new
overloads, I would expect them to do translations to and from ANSI / ASCII
by default. So for English and possibly most languages with latin based
characters, they generally should work without changes.
--
Keep up to date - read the IntraWeb blog!
http://www.atozed.com/intraweb/blog/
Remy Lebeau (TeamB)
2008-07-22 17:40:44 UTC
Permalink
Surely Tiburon has methods, but I can't discuss them except in generic was
as I did above.
The new AEncoding parameter of LoadFrom...() and SaveTo...() methods, such
as in TStrings, and the new TEncoding class, have already been publically
blogged about by CodeGear employees.


Gambit
Lee Grissom
2008-07-22 23:12:02 UTC
Permalink
What Font is the VCL defaulting to? Can anyone give advice on what
fonts to use/avoid?
--
Lee
Loading...