LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] Problem with UTF-8 encoded data

[ILUG] Problem with UTF-8 encoded data

Pádraig Brady P at draigBrady.com
Fri May 15 01:37:07 IST 2009


Niall O Broin wrote:
> On 14 May 2009, at 10:20, Pádraig Brady wrote:
> 
>> Yes, that's what Maciej said. Something interpreted UTF8 as cp1252
>> and converted _that_ to UTF8. So to convert back to valid UTF-8
>> you need to use `iconv -f utf-8 -t cp1252` which is a little confusing.
> 
> Only a little ?
> 
>>> So, we throw it through iconv AGAIN, this time like this
>>>
>>> iconv -f utf-8 -t latin1
>>>
>>> and bingo - latin1
>>
>> Note that can be lossy. As discussed last night and for my own reference,
>> the most lossless conversion to iso-8859-15 for your UTF-8 data I
>> could find was:
>>
>> sed 's/\xe2\x80\xa8/<br\/>/g' < text.utf8 | #U+2028 LINE SEPARATOR ->
>> <br/>
>> uconv -f utf8 -t utf8 -x nfc | #normalise to combined chars
>> recode utf8..iso_8859-15 > Artists.really_latin1 #note recode maps
>> em-dashes -> - ...
> 
> That doesn't produce iso_8859-15  - there are still some characters with
> multibyte encodings. Attached is the real source file I have to work
> with (as against what I sent you last night) if you feel like playing
> with it.
> 
> Note to the not Pádraig audience - you won't see the attachment as the
> list strips them.

OK, I had a look at this data, and confirmed the file
you sent me was originally UTF8 that was converted to UTF8
by a process that thought it was reading cp1252 bytes.

As we've said above, to reverse this process to
get the original UTF8, you can use:

iconv -f utf8 -t cp1252

Unfortunately this is not a fully reversible operation
as there are certain bytes that are not valid cp1252 characters.
Namely 81 8d 90 9d 9e. I.E. if any of your original UTF8 text
has those values then there will be problems converting
(note iso-8859-15 for example defines chars for all bytes
and so one would have not had this issue).

Consider for example the right curly quote in the orig UTF8 file.
This has the byte sequence: e2 80 9d
I.E. contains 9d, one of the invalid cp1252 code points.
What ever did the conversion converted these 3 bytes to:

c3 a2  e2 82 ac  c2 9d

So rather than just ignoring the invalid characters, what
we can do is convert to cp1252 but fall back to iso-8859-15
conversion, which will essentially just remove the c2 byte
as required. I don't know of existing tools that allow
you to do that, but a quick python proggy fits the bill
(python has very good support for this as it uses ICU).

#!/usr/bin/python

import sys

for c in unicode(sys.stdin.read(),"utf-8"):
    try:
        sys.stdout.write(c.encode("cp1252"))
    except:
        sys.stdout.write(c.encode("iso-8859-15"))

Now we would like recode to map these valid UTF8 characters
to the nearest corresponding ones in the iso-8859-15 charset.
However recode doesn't know how to handling combining chars,
nor can it map "U+2028 line separator". We handle these as follows:

uconv -f utf8 -t utf8 -x nfc |
sed 's/\xe2\x80\xa8/<br\/>/g'

So to recap on the whole conversion command line:

./ungarble.py < text.garbled |
uconv -f utf8 -t utf8 -x nfc |
sed 's/\xe2\x80\xa8/<br\/>/g' |
recode utf8..iso-8859-15 > text.latin9

cheers,
Pádraig.



More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell