I need a way to extract urls from a binary data file with urls spotted
in there. Preferably one that works with a toms floppy linux, as these
are windows boxen I am looking at.
I have a delicate little task to be performed from time to time. This
involves trawling through the hidden file c:\windows\History\History.IE5
\index.dat to extract urls from what is otherwise a binary file. Yes,
nobody should be using IE, even in windoze we know that. But my father,
like Newton 'discovered' Natural Laws of "Invincible Ignorance" &
"Invincible Idiocy", and wrote and spoke on the subject. The way used
has been dos edit, as reading a binary file with less screws up
consoles. That's not handy :-(.
The urls are in the browser form with things like %20 in there and the
full http:// amid other bytes which it is usually unnecessary to decode
You get lines of hex like this
0D F0 AD 0B 0D F0 AD 0B 0D F0 AD 0B 0D F0 AD 0B with odd text
Visited:administrator at http://rad.msn.com/ADSAdClient31.dll?GetAd?
Urls seem to be followed by 0x00. then anything could happen.
What follows sometimes is a (spaced out) line from the page itself. The
above url would be, I gather, passed by the msn program to IE, and this
is the fully resolved url(after cookies etc).
So I'm looking for the question marks in
grep -o --binary-files=text -E '?????' index.dat |less
and it is possible that the user (administrator here) could vary.
With Best Regards,
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!