LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

CORRECTION - Re: [ILUG] Editing unicode text files.

CORRECTION - Re: [ILUG] Editing unicode text files.

Francis Daly francisdaly at gmail.com
Mon Feb 19 22:15:36 GMT 2007


On 19/02/07, Brian Foster <blf at blf.utvinternet.ie> wrote:
>   | Date: Sun, 18 Feb 2007 14:29:58 +0000
>   | From: "Francis Daly" <francisdaly at gmail.com>
>   |

>  good essay!
>  just one pedantic semi-correction ...

Thanks. And pedantry is good when you're trying to be complete, and
especially so when you're trying to be correct. So also thanks for
pointing out this bit.

>   | On the conversions, utf-8 and ucs-2 are reversible in both directions
>   | since they are just encodings of unicode [ ... ]

>  UCS-2 can only roundtrip if
>  all the characters are in the first 2^16 UCS
>  codepoints (U+0000..U+FFFF).  (and that is also
>  way UCS-2 is obsolete, replaced by UTF-16.)

Yes, you're right.

ascii defines 128 characters we fit in an octet. utf-8 is identical to
ascii for those 128 characters, and uses only the remaining 128
characters in the octets to encode "all" other codepoints -- where
"all" is "enough to cover all of Unicode (which is 21 bits)"

ucs-2 defines 62k characters we fit in two octets. utf-16 is identical
to ucs-2 for those 62k characters, and uses only the remaining 2k
characters in the two-octets to encode "all" other codepoints -- where
"all" also covers the 21 bits needed for all of Unicode.

So ascii does not cover everything, and utf-8 is not a synonym for
ascii; but if you stay in the (limited) ascii range of "unaccented
english", their encodings are identical

And ucs-2 does not cover everything, and utf-16 is not a synonym for
ucs-2; but if you stay in the (limited, but less limited than ascii)
ucs-2 range of the "Basic Multilingual Plane", their encodings are
identical

>  in practice, most characters/codepoints are in
>  that range, but IIRC, Klingon (as an example)
>  is not.  if yer text did contain Klingon,
>  converting to UCS-2 would be a disaster.

Also true.

To make a safe roundtrip for a particular codepoint, the thing you're
tripping to must be able to encode it; for a random codepoint, that
means "an encoding that covers everything"; but if you already know
the limits of possible initial codepoints, you may get away with an
incomplete encoding.

>  for practical purposes, UTF-16 and UCS-4 (also
>  called UTF-32) also both roundtrip.

By the same analogy above, there probably is a difference between
UCS-4 and UTF-32; but it will only kick in many bits above the 21 that
Unicode uses. So for this discussion, and for anything we're ever
likely to care about, they're the same.

And they both cover "everything".

Cheers,

	f



More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell