in reply to Re: How to process two files of over a million lines for changes
in thread How to process two files of over a million lines for changes

I second Abigail's suggestion. To reiterate -- first sort, then do the comparison.

A while back I was twiddling around during some "downtime" and I cooked up the following little Perl exercise. Remove duplicate email addresses from a file where each line was an address. So, I applied the usual, build an array, sort, take out the unique addresses, write them to a new file. Took just under 2 mins on my fast Windows box at work, and about 3.5 mins on my relatively pokey iBook at home. Both machines had half Gb ram.

The file was 300+ Mb in size and had about 145 million rows in it.

;-)

  • Comment on Re: Re: How to process two files of over a million lines for changes

Replies are listed 'Best First'.
Re: How to process two files of over a million lines for changes
by Abigail-II (Bishop) on May 23, 2004 at 11:32 UTC
    I applied the usual, build an array, sort, take out the unique addresses, write them to a new file.
    For me, "the usual" would be:
    $ sort -u file > file.$$ && mv file.$$ file

    Would even work without much ram.

    Abigail

      Ya, but I am still at the "learning from the Cookbook stage." ;-).

      Now, thanks to you, I went to the terminal and learned all sorts of nice things about 'sort.' No more using Perl for this problem.

      Gives meaning to the old saying, "Give a newbie some Perl knowledge, and all problems start looking like Perl problems."