LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] PHP plus Celtic languages

[ILUG] PHP plus Celtic languages

Brian Foster blf at blf.utvinternet.co.uk
Fri Aug 20 00:07:54 IST 2004


  | From: kevin <kevin at cybercolloids.net>
  | Date: Thu, 19 Aug 2004 09:07:17 +0100
  | 
  | Yes, I specify
  |  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  |  <meta http-equiv="Content-Language" content="kw" />
  | Much to my surprise w3c has a content language for Cornish kw=kernewek
  | It seems to work OK in Mozilla and Konqueror. To continue the pedantic
  | note .... what is the correct code to use for "small t with cedilla"?
  | if not &#355;

 now that I've had a chance to think about it and do a
 bit of checking, &#355; is correct no matter what the
 page's encoding (charset) is set as.  (so my previous
 posting was slightly wrong, apologies!)

 the page's charset specifies the encoding of the page's
 own _source_.  the &#<dec>; entity is simply a source
 notation for specifying an arbitrary codepoint (UCS
 character) to be displayed by the remote client browser.
 that notation is written in (in this case) UTF-8, but
 the meaning of the &#<dec>; does not change, not even
 if the page's source charset happens to be 16-bit
 Unicode (whose proper name is UTF-16) --- I've tested
 the &#355; notation in UTF-16 in FireFox and Opera.

 ignoring arcane technical/political debates, the UCS
 (Universal Character Set, ISO standard 10646) is simply
 a list of agreed character names and values.  (the names
 and values are both unique.)  t-cedilla has been assigned
 the value 163 (hex) and name “LATIN SMALL LETTER T WITH
 CEDILLA”.  UCS codepoints are typically denoted in the
 U+<hex> notation, so t-cedilla is U+0163.

 encodings specify how a (subset of) UCS values are
 stored.  UTF-8 says U+0163 is the two byte sequence
 C5 A3 (hex).  see the utf-8(7) man page for details
 on how U+0163 is encoded as C5 A3.   ISO-8859-2 says
 that _same_ character, U+0163, is the one byte FE.
 code page 852 (CP852) says U+0163 is the different
 one byte EE.   US-ASCII (ISO standard 646), ISO-8859-1,
 and ISO-8859-15 all do _not_ have any storage for
 U+0163 at all (it is out of domain).   UTF-16 says
 U+0163 is the one 16-bit word 0x0163, and in UTF-32
 it is the one 32-bit Word 0x00000163.

 so &#355; is just a notation like U+0163.  and that
 notation --- the ampersand, the hash, the digit 3,
 the two digits 5, and the semicolon --- are themselves
 characters in the page's charset.  what the client's
 browser reads is those six characters, which it then
 understands means U+0163.  it then looks up how to
 display U+0163 in the current language (kw (Cornish)),
 and presents (draws) the glyph ţ.

 easy.  ;-)

cheers!
	-blf-

p.s.  a great on-line resource for UCS names and values,
      and some notations (like &#355;) and other details,
      is the Letter Database at the Eesti Keele Instituut
      (Institute of the Estonian Language):

           http://www.eki.ee/letter

      the `gucharmap' tool (note that is spelled _with_
      a `u' (U+0075, “LATIN SMALL LETTER U”) can also
      be helpful.


  | On Wednesday 18 August 2004 22:38, Brian Foster wrote:
  |   | From: kevin <kevin at cybercolloids.net>
  |   | Date: Wed, 18 Aug 2004 11:21:50 +0100
  |   |[ ... ]
  |   | Cornish uses some accents including t-cedilla in words such as
  |   |
  |   |  conveţhaz - Verb, to understand
  |   |
  |   | I can write this using codes in UTF-8 like conve&#355;haz  [ ... ]
  | 
  |  uh, not exactly.  “&#355” does not (cannot)
  |  represent literal UTF-8 per se.  (it _is_ the
  |  UCS codepoint value for U+0163, which is
  |  “LATIN SMALL LETTER T WITH CEDILLA”, which
  |  apparently is the character you want.)
  | 
  |[... remainder of my reply was slightly confused  ;-(  -blf ...]
-- 
«How many surrealists does it take to    |  Brian Foster      Montpellier,
 change a lightbulb?  Three.  One calms  |  blf at utvinternet.ie      FRANCE
 the warthog, and two fill the bathtub   |    Stop E$$o (ExxonMobile)!
 with brightly-colored machine tools.»   |        http://www.stopesso.com



More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell