Re: Re: How to process two files of over a million lines for changes

I second Abigail's suggestion. To reiterate -- first sort, then do the comparison.

A while back I was twiddling around during some "downtime" and I cooked up the following little Perl exercise. Remove duplicate email addresses from a file where each line was an address. So, I applied the usual, build an array, sort, take out the unique addresses, write them to a new file. Took just under 2 mins on my fast Windows box at work, and about 3.5 mins on my relatively pokey iBook at home. Both machines had half Gb ram.

The file was 300+ Mb in size and had about 145 million rows in it.

;-)

Comment on Re: Re: How to process two files of over a million lines for changes

Replies are listed 'Best First'.
Re: How to process two files of over a million lines for changes by Abigail-II (Bishop) on May 23, 2004 at 11:32 UTC
I applied the usual, build an array, sort, take out the unique addresses, write them to a new file. For me, "the usual" would be: `$ sort -u file > file.$$ && mv file.$$ file` [download] Would even work without much ram. Abigail	[reply] [d/l]
Re: Re: How to process two files of over a million lines for changes by punkish (Priest) on May 23, 2004 at 13:20 UTC
Ya, but I am still at the "learning from the Cookbook stage." ;-). Now, thanks to you, I went to the terminal and learned all sorts of nice things about 'sort.' No more using Perl for this problem. Gives meaning to the old saying, "Give a newbie some Perl knowledge, and all problems start looking like Perl problems."	[reply]