On 13 May 2009, at 23:38, Maciej Bliziński wrote:
>maciej at clover ~ $ python garbled.py | iconv -c -f utf-8 -t cp1252
> abc Ä Ö Ü ä ö ü 123
>> It means, your application has taken utf-8 for cp1252, and then
> recoded this "cp1252" to utf-8. The shell line above reverses the
> process. To fix that in MySQL, you need to convert your columns (do a
> backup first! :-) ) from utf-8, with conversion, to cp1252, then
> without conversion to binary, and then, without conversion, to utf-8.
You've certainly managed to make more sense of this than anybody else
so far, but you didn't quite get all the way. Running iconv -c -f
utf-8 -t cp1252 on the test file
I see exactly what I should, as you show above, but there's still
something odd going on.
cp1252 is a single byte encoding, yet each of the characters with
umlauts ends up as TWO bytes (much better than the handful of bytes it
was, but still more than I expected.
If I send the output of your iconv line through hexdump -C again, this
is what I get:
00000000 61 62 63 20 c3 84 20 c3 96 20 c3 9c 20 c3 a4 20 |
abc ?. ?. ?. ä |
00000010 c3 b6 20 c3 bc 20 31 32 33 0a |ö ü 123.|
which looks remarkably like - UTF-8 !
So, we throw it through iconv AGAIN, this time like this
iconv -f utf-8 -t latin1
and bingo - latin1 . So, the solution now seems to be
iconv -c -f utf-8 -t cp1252 FILE | iconv -f utf-8 -t latin1
which, on the face of it, is bizarre - convert FROM utf-8 TO cp1252
(which is pretty close to latin1) and then convert THAT from utf-8 to
latin.
I don't actually want to 'fix' the MySQL. It may well be broken by any
reasonable definition of broken, but it's producing correct output on
its own site. The conversion is needed when exporting the data to a
site which wants latin1, and I'm hoping that the above will do the
trick - I should know in the morning.
Niall
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!