Niall O Broin wrote:
> On 14 May 2009, at 10:20, Pádraig Brady wrote:
>>> Yes, that's what Maciej said. Something interpreted UTF8 as cp1252
>> and converted _that_ to UTF8. So to convert back to valid UTF-8
>> you need to use `iconv -f utf-8 -t cp1252` which is a little confusing.
>> Only a little ?
>>>> So, we throw it through iconv AGAIN, this time like this
>>>>>> iconv -f utf-8 -t latin1
>>>>>> and bingo - latin1
>>>> Note that can be lossy. As discussed last night and for my own reference,
>> the most lossless conversion to iso-8859-15 for your UTF-8 data I
>> could find was:
>>>> sed 's/\xe2\x80\xa8/<br\/>/g' < text.utf8 | #U+2028 LINE SEPARATOR ->
>> <br/>
>> uconv -f utf8 -t utf8 -x nfc | #normalise to combined chars
>> recode utf8..iso_8859-15 > Artists.really_latin1 #note recode maps
>> em-dashes -> - ...
>> That doesn't produce iso_8859-15 - there are still some characters with
> multibyte encodings. Attached is the real source file I have to work
> with (as against what I sent you last night) if you feel like playing
> with it.
>> Note to the not Pádraig audience - you won't see the attachment as the
> list strips them.
OK, I had a look at this data, and confirmed the file
you sent me was originally UTF8 that was converted to UTF8
by a process that thought it was reading cp1252 bytes.
As we've said above, to reverse this process to
get the original UTF8, you can use:
iconv -f utf8 -t cp1252
Unfortunately this is not a fully reversible operation
as there are certain bytes that are not valid cp1252 characters.
Namely 81 8d 90 9d 9e. I.E. if any of your original UTF8 text
has those values then there will be problems converting
(note iso-8859-15 for example defines chars for all bytes
and so one would have not had this issue).
Consider for example the right curly quote in the orig UTF8 file.
This has the byte sequence: e2 80 9d
I.E. contains 9d, one of the invalid cp1252 code points.
What ever did the conversion converted these 3 bytes to:
c3 a2 e2 82 ac c2 9d
So rather than just ignoring the invalid characters, what
we can do is convert to cp1252 but fall back to iso-8859-15
conversion, which will essentially just remove the c2 byte
as required. I don't know of existing tools that allow
you to do that, but a quick python proggy fits the bill
(python has very good support for this as it uses ICU).
#!/usr/bin/python
import sys
for c in unicode(sys.stdin.read(),"utf-8"):
try:
sys.stdout.write(c.encode("cp1252"))
except:
sys.stdout.write(c.encode("iso-8859-15"))
Now we would like recode to map these valid UTF8 characters
to the nearest corresponding ones in the iso-8859-15 charset.
However recode doesn't know how to handling combining chars,
nor can it map "U+2028 line separator". We handle these as follows:
uconv -f utf8 -t utf8 -x nfc |
sed 's/\xe2\x80\xa8/<br\/>/g'
So to recap on the whole conversion command line:
./ungarble.py < text.garbled |
uconv -f utf8 -t utf8 -x nfc |
sed 's/\xe2\x80\xa8/<br\/>/g' |
recode utf8..iso-8859-15 > text.latin9
cheers,
Pádraig.
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!