LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] [OT puzzle] Grep regex please?

[ILUG] [OT puzzle] Grep regex please?

Declan Moriarty junk_mail at iol.ie
Tue Jun 27 11:58:49 IST 2006


I need a way to extract urls from a binary data file with urls spotted
in there. Preferably one that works with  a toms floppy linux, as these
are windows boxen I am looking at.

/explanation
I have a delicate little task to be performed from time to time. This
involves trawling through the hidden file c:\windows\History\History.IE5
\index.dat to extract urls from what is otherwise a binary file. Yes,
nobody should be using IE, even in windoze we know that. But my father,
like Newton 'discovered' Natural Laws of "Invincible Ignorance" &
"Invincible Idiocy", and wrote and spoke on the subject. The  way used
has been dos edit, as reading a binary file with less screws up 
consoles. That's not handy :-(.
/end explanation

The urls are in the browser form with things like %20 in there and the
full http:// amid other bytes which it is usually unnecessary to decode
You get lines of hex like this

0D F0 AD 0B  0D F0 AD 0B   0D F0 AD 0B  0D F0 AD 0B   with odd text
characters, then
Visited:administrator at http://rad.msn.com/ADSAdClient31.dll?GetAd?
PG=IMSIRD?SC=HF

Urls seem to be followed by 0x00. then anything could happen.
What follows sometimes is a (spaced out) line from the page itself. The
above url would be, I gather, passed by the msn program to IE, and this
is the fully resolved url(after cookies etc).

So I'm looking for the question marks in 

grep -o --binary-files=text -E '?????'  index.dat |less

and it is possible that the user (administrator here) could vary.
-- 
        With Best Regards,

        Declan Moriarty.




More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell