| From: kevin <kevin at cybercolloids.net>
| Date: Thu, 19 Aug 2004 09:07:17 +0100
|
| Yes, I specify
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
| <meta http-equiv="Content-Language" content="kw" />
| Much to my surprise w3c has a content language for Cornish kw=kernewek
| It seems to work OK in Mozilla and Konqueror. To continue the pedantic
| note .... what is the correct code to use for "small t with cedilla"?
| if not ţ
now that I've had a chance to think about it and do a
bit of checking, ţ is correct no matter what the
page's encoding (charset) is set as. (so my previous
posting was slightly wrong, apologies!)
the page's charset specifies the encoding of the page's
own _source_. the &#<dec>; entity is simply a source
notation for specifying an arbitrary codepoint (UCS
character) to be displayed by the remote client browser.
that notation is written in (in this case) UTF-8, but
the meaning of the &#<dec>; does not change, not even
if the page's source charset happens to be 16-bit
Unicode (whose proper name is UTF-16) --- I've tested
the ţ notation in UTF-16 in FireFox and Opera.
ignoring arcane technical/political debates, the UCS
(Universal Character Set, ISO standard 10646) is simply
a list of agreed character names and values. (the names
and values are both unique.) t-cedilla has been assigned
the value 163 (hex) and name “LATIN SMALL LETTER T WITH
CEDILLA”. UCS codepoints are typically denoted in the
U+<hex> notation, so t-cedilla is U+0163.
encodings specify how a (subset of) UCS values are
stored. UTF-8 says U+0163 is the two byte sequence
C5 A3 (hex). see the utf-8(7) man page for details
on how U+0163 is encoded as C5 A3. ISO-8859-2 says
that _same_ character, U+0163, is the one byte FE.
code page 852 (CP852) says U+0163 is the different
one byte EE. US-ASCII (ISO standard 646), ISO-8859-1,
and ISO-8859-15 all do _not_ have any storage for
U+0163 at all (it is out of domain). UTF-16 says
U+0163 is the one 16-bit word 0x0163, and in UTF-32
it is the one 32-bit Word 0x00000163.
so ţ is just a notation like U+0163. and that
notation --- the ampersand, the hash, the digit 3,
the two digits 5, and the semicolon --- are themselves
characters in the page's charset. what the client's
browser reads is those six characters, which it then
understands means U+0163. it then looks up how to
display U+0163 in the current language (kw (Cornish)),
and presents (draws) the glyph ţ.
easy. ;-)
cheers!
-blf-
p.s. a great on-line resource for UCS names and values,
and some notations (like ţ) and other details,
is the Letter Database at the Eesti Keele Instituut
(Institute of the Estonian Language):
http://www.eki.ee/letter
the `gucharmap' tool (note that is spelled _with_
a `u' (U+0075, “LATIN SMALL LETTER U”) can also
be helpful.
| On Wednesday 18 August 2004 22:38, Brian Foster wrote:
| | From: kevin <kevin at cybercolloids.net>
| | Date: Wed, 18 Aug 2004 11:21:50 +0100
| |[ ... ]
| | Cornish uses some accents including t-cedilla in words such as
| |
| | conveţhaz - Verb, to understand
| |
| | I can write this using codes in UTF-8 like conveţhaz [ ... ]
|
| uh, not exactly. “ţ” does not (cannot)
| represent literal UTF-8 per se. (it _is_ the
| UCS codepoint value for U+0163, which is
| “LATIN SMALL LETTER T WITH CEDILLA”, which
| apparently is the character you want.)
|
|[... remainder of my reply was slightly confused ;-( -blf ...]
--
«How many surrealists does it take to | Brian Foster Montpellier,
change a lightbulb? Three. One calms | blf at utvinternet.ie FRANCE
the warthog, and two fill the bathtub | Stop E$$o (ExxonMobile)!
with brightly-colored machine tools.» | http://www.stopesso.com
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!