LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

CORRECTION - Re: [ILUG] Editing unicode text files.

CORRECTION - Re: [ILUG] Editing unicode text files.

Brian Foster blf at blf.utvinternet.ie
Tue Feb 20 08:27:23 GMT 2007


  | Date: Mon, 19 Feb 2007 22:15:36 +0000
  | From: "Francis Daly" <francisdaly at gmail.com>
  |[ ... ]
  | >  for practical purposes, UTF-16 and UCS-4 (also
  | >  called UTF-32) also both roundtrip.
  | 
  | By the same analogy [ difference between UCS-2 and UTF-16 ...],
  | there probably is a difference between UCS-4 and UTF-32; but it
  | will only kick in many bits above the 21 that Unicode uses.
  | So for this discussion, and for anything we're ever likely to
  | care about, they're the same.

 correct, but this is now diving into politics,
 and in particular, Redmond vs. RoW:  UTF-16 can
 only encode the initial 2^21 UCS codepoints (what
 I'll call the “Unicode range”) but none larger.
 it is technically impossible.†  in contrast, UCS-4
 and UTF-32 (and UTF-8) can encode everything (the
 complete range of 2^31 UCS/ISO-10646 codepoints).

 UCS-4 and UTF-32 are bit-for-bit identical.
 so why the two names?

 in a word, M$.  M$ (at least) is (or at least was,
 I'm not sure what the current status is) pushing a
 definition of UTF-8 which  (1) is used to encode
 only the 2^21 Unicode range;  and  (2) must start
 with a BOMb.  (in M$'s world, BOMb-less UTF-8 is
 called UTF-8N, but IIRC, is still used only for
 the 2^21 Unicode range despite being capable of
 encoding the full 2^31.)

 similarly, M$ is pushing a definition of UTF-32
 which is UCS-4 but used to encode only the 2^21
 Unicode range (and, I assume, starts with a BOMb).

 those are not the ISO definitions.  (however, I
 vaguely recall they have crept into the Unicode
 Consortium's terminology?)

 having said that, except for the BOMb issue, it
 doesn't really matter:  ISO has agreed to not
 define codepoints larger than Unicode's 2^21 cutoff
 (actually, it's U+10FFFF (IIRC), sometimes written
 2^21.5 (IIRC), but that's neither here nor there).

 upshot is UCS-4 and UTF-32 are indeed “the same”.
 both do indeed “cover everything”.  so does UTF-8,
 and (in practice) UTF-16, but not UCS-2.

cheers!
	-blf-

  †  you could, I suppose, do a similar extension
     trick that builds UTF-16 from UCS-2, but no
     such extension has been defined.  ergo, it's
     (currently) technically impossible to encode
     codepoints larger than U+10FFFF (the "2^21")
     in UTF-16.

  | And they both cover "everything".
  | 
  | Cheers,
  | 
  | 	f
-- 
Experienced (>25 yrs) kernel/software Eng: | Brian Foster   Montpellier,
 • Unix, embedded, &tc;  • Linux;  • doc;  | blf at utvinternet.ie   FRANCE
 • IDL, automated testing, process, &tc.   |  Stop E$$o (ExxonMobile)!
Résumé (CV) http://www.blf.utvinternet.ie  |     http://www.stopesso.com



More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell