Niall O Broin wrote:
> On 13 May 2009, at 23:38, Maciej Bliziński wrote:
>>>maciej at clover ~ $ python garbled.py | iconv -c -f utf-8 -t cp1252
>> abc Ä Ö Ü ä ö ü 123
>>>> It means, your application has taken utf-8 for cp1252, and then
>> recoded this "cp1252" to utf-8. The shell line above reverses the
>> process. To fix that in MySQL, you need to convert your columns (do a
>> backup first! :-) ) from utf-8, with conversion, to cp1252, then
>> without conversion to binary, and then, without conversion, to utf-8.
>> You've certainly managed to make more sense of this than anybody else so
> far, but you didn't quite get all the way. Running iconv -c -f utf-8
> -t cp1252 on the test file
> I see exactly what I should, as you show above, but there's still
> something odd going on.
> cp1252 is a single byte encoding, yet each of the characters with
> umlauts ends up as TWO bytes (much better than the handful of bytes it
> was, but still more than I expected.
>> If I send the output of your iconv line through hexdump -C again, this
> is what I get:
>> 00000000 61 62 63 20 c3 84 20 c3 96 20 c3 9c 20 c3 a4 20 |abc ?. ?.
> ?. ä |
> 00000010 c3 b6 20 c3 bc 20 31 32 33 0a |ö ü 123.|
>> which looks remarkably like - UTF-8 !
Yes, that's what Maciej said. Something interpreted UTF8 as cp1252
and converted _that_ to UTF8. So to convert back to valid UTF-8
you need to use `iconv -f utf-8 -t cp1252` which is a little confusing.
> So, we throw it through iconv AGAIN, this time like this
>> iconv -f utf-8 -t latin1
>> and bingo - latin1
Note that can be lossy. As discussed last night and for my own reference,
the most lossless conversion to iso-8859-15 for your UTF-8 data I could find was:
sed 's/\xe2\x80\xa8/<br\/>/g' < text.utf8 | #U+2028 LINE SEPARATOR -> <br/>
uconv -f utf8 -t utf8 -x nfc | #normalise to combined chars
recode utf8..iso_8859-15 > Artists.really_latin1 #note recode maps em-dashes -> - ...
cheers,
Pádraig.
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!