LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] Problem with UTF-8 encoded data

[ILUG] Problem with UTF-8 encoded data

Niall O Broin niall at linux.ie
Thu May 14 00:54:31 IST 2009


On 13 May 2009, at 23:38, Maciej Bliziński wrote:

> maciej at clover ~ $ python garbled.py | iconv -c -f utf-8 -t cp1252
> abc Ä Ö Ü ä ö ü 123
>
> It means, your application has taken utf-8 for cp1252, and then
> recoded this "cp1252" to utf-8. The shell line above reverses the
> process. To fix that in MySQL, you need to convert your columns (do a
> backup first! :-) ) from utf-8, with conversion, to cp1252, then
> without conversion to binary, and then, without conversion, to utf-8.

You've certainly managed to make more sense of this than anybody else  
so far, but you didn't quite get all the way.  Running   iconv -c -f  
utf-8 -t cp1252  on the test file
I see exactly what I should, as you show above, but there's still  
something odd going on.
cp1252 is a single byte encoding, yet each of the characters with  
umlauts ends up as TWO bytes (much better than the handful of bytes it  
was, but still more than I expected.

If I send the output of your iconv line through hexdump -C again, this  
is what I get:

00000000  61 62 63 20 c3 84 20 c3  96 20 c3 9c 20 c3 a4 20  | 
abc ?. ?. ?. ä |
00000010  c3 b6 20 c3 bc 20 31 32  33 0a                    |ö ü 123.|

which looks remarkably like - UTF-8 !

So, we throw it through iconv AGAIN, this time like this

iconv -f utf-8 -t latin1

and bingo - latin1 . So, the solution now seems to be

iconv -c -f utf-8 -t cp1252  FILE | iconv -f utf-8 -t latin1

which, on the face of it, is bizarre - convert FROM utf-8 TO cp1252  
(which is pretty close to latin1) and then convert THAT from utf-8 to  
latin.

I don't actually want to 'fix' the MySQL. It may well be broken by any  
reasonable definition of broken, but it's producing correct output on  
its own site. The conversion is needed when exporting the data to a  
site which wants latin1, and I'm hoping that the above will do the  
trick - I should know in the morning.


Niall




More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell