Re: [ILUG] XML & UTF Encoding

From: Matthew French (mfrench42 at domain yahoo.co.uk)
Date: Thu 25 Apr 2002 - 15:01:16 IST


David Neary had a migraine:
> > You are correct - 0xC3A9 is the UTF-8 encoded version the Unicode
character
> > 0xE9.
>
> My head hurts.
>
> OK - so there exists a bijective mapping from Unicode to UTF-8,
> but they're not the same thing. I can live with that. I wasn't
> aware there was a difference.

Unicode is the standard, UTF-8 is an implementation? In other words, Unicode
assigns every single possible character a unique number, whereas UTF-8 is
just a way of encoding that number.

Unicode uses a 32-bit address space, but encoding every character using four
bytes would be a complete waste of space. So UTF-8 encoding ensures that the
most commonly used characters (7-bit ASCII) occupy just 1 byte, less common
characters occupy 2 bytes, and so on.

There is also a UTF-16 encoding format if speed is more important than size.
There are also many other alternatives if one gets bored.

See the following link for some more information:
http://czyborra.com/utf/

- Matthew

_________________________________________________________
Do You Yahoo!?
Get your free at domain yahoo.com address at http://mail.yahoo.com



This archive was generated by hypermail 2.1.6 : Thu 06 Feb 2003 - 13:16:22 GMT