LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] Re: meta http-equiv useless??

[ILUG] Re: meta http-equiv useless??

greg wm ilug at nvpf.org
Sun Aug 21 07:13:36 IST 2005


greg wm wrote:
> i used wget to copy the entire http://nonviolentpeaceforce.org site to
> http://nvpf.org/np.  the former is in m$ asp, the latter captured as html.
> 
> for example, http://nonviolentpeaceforce.org/spanish/welcome.asp was
> captured to http://nvpf.org/np/spanish/welcome.asp.html
> 
> as you can see, the capture is mostly fine, including spanish characters
> in the text (eg año), however the spanish characters in the menus didn't
> do quite so well (eg Misi?n)

fixed!  see below.

> in the file año appears as año which is apparently "good", but
> Misi?n appears as Misión, which is apparently "bad".
> 
> first question:  why is that bad?
> 
> if i tell galeon, instead of automatic encoding, use western iso-8859-1,
> then, presto, the page appears nicely.  but i don't have to do that to 
> see the original, nor do i have to do that for anybody else's pages, and 
> of course i can't expect our audience to go and fiddle with that in 
> their browsers.
> 
> second question:  why doesn't the meta http-equiv header do anything?
> 
> right after the title the file says <meta http-equiv="Content-Type" 
> content="text/html; charset=iso-8859-1">.  why isn't that good enough? 
> why does it make no difference at all what i change it to?  i tried 
> utf-8, Utf-8, UTF-8, Windows-1252, none have any effect tho i can see 
> them if i tell my browser to view source.

overridden by apache's http headers, apparently.  see below.

> fourth question:  can wget be tweaked to do better?
>
> i think those menus were rendered out of some .asp database or
> whatever, differently than the rest of the text of the page.  but so 
> what?  why didn't wget capture something identical to what my
> browser shows?
> 
> the command i ran was
> wget -ENKkrl19 -nH -w2 -owget.log http://nonviolentpeaceforce.org

my locale is en_IE.UTF-8, so why did wget save in latin-1 format?
the wget manual page mentions nothing at all about character sets.

> well whatever, thunk i, no problem, i'll just find and replace.  well 
> ha.  i haven't yet managed to craft sed to capture the buggers!  it's 
> all making me feel dang defeated..

Brian Foster wrote:
>  there are other alternatives.  e.g., convert the page
>  (file) to UTF-8 (e.g., using iconv(1)), being sure to
>  change the meta charset setting to utf-8.
>
>  finally, vim(1):  vim confuses things here (I am _not_
>  trying to start an editor war!).  vim guesses what the
>  file's charset is, and adjusts accordingly so that you
>  can view/edit it in a locale using a different charset.
>  hence, a lot of things that cat displays as rubbish
>  display Ok in vim.  if, in vim, you do a “:set” command,
>  you'll probably see an entry like “fileencoding=utf-8”.
>  that means vim thinks the file is UTF-8.  (probably,
>  for the ISO-8859-1 files, it says “fileencoding=latin1”;
>  Latin1 is a common informal(?) name for ISO-8859-1.)
>  “:help fileencoding” for more information.

thank you brian!  perhaps iconv might have done the trick, anyway i used
vim.  vim :se fileencoding revealed that wget saved the files in
latin-1, and :se fileencoding=utf-8 for each file cleaned up the mess.
wasn't even a big job after using :map such that each file was fixed
with a single keystroke.

William A. Rowe, Jr. wrote:
> What happens if you remove the defaultcharset entirely; have Apache
> provide no hinting at the encoding; does the browser respect the meta
> tag?
>
> The http headers are authoritative, and override any metadata.  If you
> rather control your encoding with meta tags, turn off charsets entirely.

that is probably the winning answer.  i already applied the above
solution so i dunno for sure, but look..

wget --save-headers from the original m$ .asp server:
   HTTP/1.1 200 OK
   Server: Microsoft-IIS/5.0
   Date: Sat, 20 Aug 2005 21:18:55 GMT
   Connection: keep-alive
   Connection: Keep-Alive
   Content-Length: 11003
   Content-Type: text/html
   Set-Cookie: ASPSESSIONIDAQQBRDRA=KNIPOCMDJPKMMANLNHFMKMGH; path=/
   Cache-control: private

wget --save-headers from my apache server:
   HTTP/1.1 200 OK
   Date: Sun, 21 Aug 2005 04:10:34 GMT
   Server: Apache/2.0.52 (CentOS)
   Last-Modified: Sun, 21 Aug 2005 01:34:43 GMT
   ETag: "260261-2b33-9134b2c0"
   Accept-Ranges: bytes
   Content-Length: 11059
   Connection: close
   Content-Type: text/html; charset=UTF-8

now i wouldn't have thought that the following httpd.conf directive
would result in overriding the meta http-equiv headers, but, there does
seem to be a strong odor..

# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
#
AddDefaultCharset UTF-8

greg

> Greg Whitley Mott
> IT Coordinator
> NonviolentPeaceforce.org



More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell