| Date: Mon, 19 Feb 2007 22:15:36 +0000
| From: "Francis Daly" <francisdaly at gmail.com>
|[ ... ]
| > for practical purposes, UTF-16 and UCS-4 (also
| > called UTF-32) also both roundtrip.
|
| By the same analogy [ difference between UCS-2 and UTF-16 ...],
| there probably is a difference between UCS-4 and UTF-32; but it
| will only kick in many bits above the 21 that Unicode uses.
| So for this discussion, and for anything we're ever likely to
| care about, they're the same.
correct, but this is now diving into politics,
and in particular, Redmond vs. RoW: UTF-16 can
only encode the initial 2^21 UCS codepoints (what
I'll call the “Unicode range”) but none larger.
it is technically impossible.† in contrast, UCS-4
and UTF-32 (and UTF-8) can encode everything (the
complete range of 2^31 UCS/ISO-10646 codepoints).
UCS-4 and UTF-32 are bit-for-bit identical.
so why the two names?
in a word, M$. M$ (at least) is (or at least was,
I'm not sure what the current status is) pushing a
definition of UTF-8 which (1) is used to encode
only the 2^21 Unicode range; and (2) must start
with a BOMb. (in M$'s world, BOMb-less UTF-8 is
called UTF-8N, but IIRC, is still used only for
the 2^21 Unicode range despite being capable of
encoding the full 2^31.)
similarly, M$ is pushing a definition of UTF-32
which is UCS-4 but used to encode only the 2^21
Unicode range (and, I assume, starts with a BOMb).
those are not the ISO definitions. (however, I
vaguely recall they have crept into the Unicode
Consortium's terminology?)
having said that, except for the BOMb issue, it
doesn't really matter: ISO has agreed to not
define codepoints larger than Unicode's 2^21 cutoff
(actually, it's U+10FFFF (IIRC), sometimes written
2^21.5 (IIRC), but that's neither here nor there).
upshot is UCS-4 and UTF-32 are indeed “the same”.
both do indeed “cover everything”. so does UTF-8,
and (in practice) UTF-16, but not UCS-2.
cheers!
-blf-
† you could, I suppose, do a similar extension
trick that builds UTF-16 from UCS-2, but no
such extension has been defined. ergo, it's
(currently) technically impossible to encode
codepoints larger than U+10FFFF (the "2^21")
in UTF-16.
| And they both cover "everything".
|
| Cheers,
|
| f
--
Experienced (>25 yrs) kernel/software Eng: | Brian Foster Montpellier,
• Unix, embedded, &tc; • Linux; • doc; | blf at utvinternet.ie FRANCE
• IDL, automated testing, process, &tc. | Stop E$$o (ExxonMobile)!
Résumé (CV) http://www.blf.utvinternet.ie | http://www.stopesso.com
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!