LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] Problem with UTF-8 encoded data

[ILUG] Problem with UTF-8 encoded data

Pádraig Brady P at draigBrady.com
Thu May 14 10:20:34 IST 2009


Niall O Broin wrote:
> On 13 May 2009, at 23:38, Maciej Bliziński wrote:
> 
>> maciej at clover ~ $ python garbled.py | iconv -c -f utf-8 -t cp1252
>> abc Ä Ö Ü ä ö ü 123
>>
>> It means, your application has taken utf-8 for cp1252, and then
>> recoded this "cp1252" to utf-8. The shell line above reverses the
>> process. To fix that in MySQL, you need to convert your columns (do a
>> backup first! :-) ) from utf-8, with conversion, to cp1252, then
>> without conversion to binary, and then, without conversion, to utf-8.
> 
> You've certainly managed to make more sense of this than anybody else so
> far, but you didn't quite get all the way.  Running   iconv -c -f utf-8
> -t cp1252  on the test file
> I see exactly what I should, as you show above, but there's still
> something odd going on.
> cp1252 is a single byte encoding, yet each of the characters with
> umlauts ends up as TWO bytes (much better than the handful of bytes it
> was, but still more than I expected.
> 
> If I send the output of your iconv line through hexdump -C again, this
> is what I get:
> 
> 00000000  61 62 63 20 c3 84 20 c3  96 20 c3 9c 20 c3 a4 20  |abc ?. ?.
> ?. ä |
> 00000010  c3 b6 20 c3 bc 20 31 32  33 0a                    |ö ü 123.|
> 
> which looks remarkably like - UTF-8 !

Yes, that's what Maciej said. Something interpreted UTF8 as cp1252
and converted _that_ to UTF8. So to convert back to valid UTF-8
you need to use `iconv -f utf-8 -t cp1252` which is a little confusing.

> So, we throw it through iconv AGAIN, this time like this
> 
> iconv -f utf-8 -t latin1
> 
> and bingo - latin1

Note that can be lossy. As discussed last night and for my own reference,
the most lossless conversion to iso-8859-15 for your UTF-8 data I could find was:

sed 's/\xe2\x80\xa8/<br\/>/g' < text.utf8 | #U+2028 LINE SEPARATOR -> <br/>
uconv -f utf8 -t utf8 -x nfc | #normalise to combined chars
recode utf8..iso_8859-15 > Artists.really_latin1 #note recode maps em-dashes -> - ...

cheers,
Pádraig.




More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell