LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] Remove duplicate lines from a file?

[ILUG] Remove duplicate lines from a file?

Conor Daly conor.daly at oceanfree.net
Fri Jun 30 23:17:01 IST 2000


-----Original Message-----
From: Fergal Daly <fergal at esatclear.ie>
To: Niall O Broin <niall at magicgoeshere.com>; Conor Daly
<conor.daly at oceanfree.net>
Cc: ilug at linux.ie <ilug at linux.ie>
Date: 30 June 2000 22:53
Subject: Re: [ILUG] Remove duplicate lines from a file?


>At 16:39 30/06/00, Niall  O Broin wrote:
>>
>>perl -ne 'print unless ($seen{$_}++)'
>>
>>as a pipe to do the job. There's one slight hitch - this will consume
memory
>>like there's no tomorrow. If the file(s) you want to treat are somewhat
>>smaller than your free virtual memory, you'll be OK.
>
>In a similar vein
>
>perl -MMD5 -ne 'print unless $seen{MD5->hash($_)}++'
>
>should consume lots less memory if the lines are long, of course if you're
>really unfortunate 2 of your lines may hash to the same string under MD5
>but this is highly unlikely, especially if the lines re in some kind of
>regular format. Personally I don't think I'd use this, unless I was just
>trying to get statistics on how many duplicates there are, but I thought it
>was fun,
>
>Fergal
>


I was thinking about checksumming for really big files but I don't think
I'll need it here.  I've got a total of about 17,000 unique lines of about
<100 bytes each for a total of about 1.5Mb of unique data.  Should fint into
32Mb RAM with 64Mb swap ok...

I got the impression somewhere that the World would END before any two
unique files / strings would produce the same hash from MD5 :-)

---
Conor Daly

Ph   +353 1 8326146

conor.daly at oceanfree.net
------------------------------------------





More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell