On 19/02/07, Brian Foster <blf at blf.utvinternet.ie> wrote:
> | Date: Sun, 18 Feb 2007 14:29:58 +0000
> | From: "Francis Daly" <francisdaly at gmail.com>
> |
> good essay!
> just one pedantic semi-correction ...
Thanks. And pedantry is good when you're trying to be complete, and
especially so when you're trying to be correct. So also thanks for
pointing out this bit.
> | On the conversions, utf-8 and ucs-2 are reversible in both directions
> | since they are just encodings of unicode [ ... ]
> UCS-2 can only roundtrip if
> all the characters are in the first 2^16 UCS
> codepoints (U+0000..U+FFFF). (and that is also
> way UCS-2 is obsolete, replaced by UTF-16.)
Yes, you're right.
ascii defines 128 characters we fit in an octet. utf-8 is identical to
ascii for those 128 characters, and uses only the remaining 128
characters in the octets to encode "all" other codepoints -- where
"all" is "enough to cover all of Unicode (which is 21 bits)"
ucs-2 defines 62k characters we fit in two octets. utf-16 is identical
to ucs-2 for those 62k characters, and uses only the remaining 2k
characters in the two-octets to encode "all" other codepoints -- where
"all" also covers the 21 bits needed for all of Unicode.
So ascii does not cover everything, and utf-8 is not a synonym for
ascii; but if you stay in the (limited) ascii range of "unaccented
english", their encodings are identical
And ucs-2 does not cover everything, and utf-16 is not a synonym for
ucs-2; but if you stay in the (limited, but less limited than ascii)
ucs-2 range of the "Basic Multilingual Plane", their encodings are
identical
> in practice, most characters/codepoints are in
> that range, but IIRC, Klingon (as an example)
> is not. if yer text did contain Klingon,
> converting to UCS-2 would be a disaster.
Also true.
To make a safe roundtrip for a particular codepoint, the thing you're
tripping to must be able to encode it; for a random codepoint, that
means "an encoding that covers everything"; but if you already know
the limits of possible initial codepoints, you may get away with an
incomplete encoding.
> for practical purposes, UTF-16 and UCS-4 (also
> called UTF-32) also both roundtrip.
By the same analogy above, there probably is a difference between
UCS-4 and UTF-32; but it will only kick in many bits above the 21 that
Unicode uses. So for this discussion, and for anything we're ever
likely to care about, they're the same.
And they both cover "everything".
Cheers,
f
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!