On Mon, May 20, 2002 at 09:00:24PM +0100 or so it is rumoured hereabouts,
Gavin McCullagh thought:
> On Mon, 20 May 2002, Garreth McCullagh wrote:
>> > The problem is that I do alot of web searches for finding company web site
> > addresses and this can be, to say the least boring and repetitive and tricky
> > to get the right one. Especially when there hundreds to be done.
> > Currently it involves taking each name separately Copy and Paste into a
> > Search Engine, hope that it returns the web site address. Then copy the
> > address back into the database.
>> Well, the problem that you'll always have is, how do you know the search
> engine is correct? Which hit should your system take. A very simple
> solution which would be perhaps quite accurate would be to use the Google
> "I'm Feeling Lucky" thing which depending on the site can be pretty
> accurate. Something like the following
>>gavin at fiachra gavin> curl \
> 'http://www.google.com/search?hl=en&q=microsoft&btnI=Im%20Feeling%20Lucky' \
> -A "Mozilla/4.0"
> <HTML><HEAD><TITLE>302 Moved</TITLE></HEAD><BODY>
> <H1>302 Moved</H1>
> The document has moved
> <A HREF="http://www.microsoft.com/">here</A>.
> </BODY></HTML>
>> in a script. A couple of things to explain. Curl is a command line tool
> for getting urls. Google doesn't like to be scripted so you have to
> pretend you're using netscape by adding -A "Mozilla/4.0".
>> At this point you need a script to wrap around it and do say several hundred
> at a time. The script below reads in a set of company names from
> "somefile". In this example it's
>> microsoft
> ireland
> hewlett%20packard
>> NB. spaces need to be turned into %20
>> then do....
>>gavin at infinitum gavin> for i in `cat somefile`; do curl \
> 'http://www.google.com/search?hl=en&q='$i'&btnI=Im%20Feeling%20Lucky' \
> -A "Mozilla/4.0" 2>/dev/null | grep HREF |sed 's/<A HREF=\"//' \
> |sed 's/\">here<\/A>\.//' ; done
>http://www.microsoft.com/>http://www.ireland.com/>http://www.hp.com/>> If you want to be putting things in a SQL db that extra bit is left as an
> exercise ;) Hope this helps,
and, of course, a method to choose which of the returned URLs to accept
would be nice. eg. a search for "ireland" might return www.ireland.com
and www.discoverireland.com so you'd like something like
Searched for "ireland" and found:
1. http://www.ireland.com
2. http://www.discoverireland.com
n. none of the above
enter choice [n]:
that's easy enough (hint: exec < $THISTTY; read CHOICE )
Conor
--
Conor Daly <conor.daly at oceanfree.net>
Domestic Sysadmin :-)
---------------------
Faenor.cod.ie
9:11pm up 4 days, 10:37, 0 users, load average: 0.00, 0.00, 0.00
Hobbiton.cod.ie
8:11pm up 4 days, 10:41, 3 users, load average: 0.04, 0.03, 0.00
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!