LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] Problem with UTF-8 encoded data

[ILUG] Problem with UTF-8 encoded data

Niall O Broin niall at linux.ie
Thu May 14 12:07:55 IST 2009


On 14 May 2009, at 11:27, Maciej Bliziński wrote:

> 2009/5/14 Niall O Broin <niall at linux.ie>:
>> If I send the output of your iconv line through hexdump -C again,  
>> this is
>> what I get:
>>
>> 00000000  61 62 63 20 c3 84 20 c3  96 20 c3 9c 20 c3 a4 20  | 
>> abc ?. ?. ?. ä
>> |
>> 00000010  c3 b6 20 c3 bc 20 31 32  33 0a                    |ö ü  
>> 123.|
>>
>> which looks remarkably like - UTF-8 !
>
> Yes, because that's what your terminal uses. When you say "I see
> exactly what I should", it essentially means that you have the text
> correctly encoded in utf-8, otherwise it wouldn't display correctly.
> The command I've shown in my first e-mail converts your garbled text
> to the encoding that your system/terminal displays. If you wanted the
> conversion to be complete and explicit, you could write:
>
> iconv -f utf-8 -t cp1252 | iconv -f utf-8 -t <your-terminal's- 
> encoding>

and indeed, that's what I finally had to do, with <your-terminal's- 
encoding> replaced by latin1, as the final destination of the text  
wanted that encoding.

> If <your-terminal's encoding> is utf-8, it'll be an identity, which
> you can safely skip. The crucial point is where you convert "to
> cp1252" and then interpret it as utf-8.

Yes, and this is the bit which flabbergasts me - conversion from utf-8  
to cp1252 produces valid utf-8. It's like the original file was in  
(utf-8)^2  :-)

As the American politician reputedly said - if English was good enough  
for Jesus Christ, it's good enough for me.

> As a side note, try no to use the -c option with iconv -- it will hide
> lossy conversion. Having iconv failing with "illegal input sequence"
> is a good indicator of data loss during conversion.

Yes - and if I DON'T use -c, I do get 'illegal input sequence'. But  
for what I need to do, some data loss during conversion is preferable  
to no conversion at all - or rather having the conversion halt at the  
first thing it can't handle, which is what happens.

Thanks once again for your assistance, and to Pádraig too (I gather  
you two were chewing it over in the pub :-)  )


Niall





More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell