in reply to Filtering very large files using Tie::File

For filtering duplicates, you need only to remember what elements you already have written to the file. You don't need Tie::File, just a loop that uses a hash to remember what lines with what keys have already been written to the file. If memory is still scarce, you can tie that hash to disk:

open my $in, '<', $infile or die "Couldn't open '$infile': $!"; open my $out, '>', $outfile or die "Couldn't create '$outfile': $!"; my %seen; while (<$in>) { my $key = $in; # change this to whatever key generation you need if (! $seen{ $key }++) { print $out $in; }; };

Replies are listed 'Best First'.
Re^2: Filtering very large files using Tie::File
by elef (Friar) on Nov 26, 2010 at 17:41 UTC
    Thanks, this is starting to look a lot simpler than I imagined.

    So I could just do
    $in =~ /^([^\t]*\t[^\t]*)/; $key = $1;
    and that would take care of doing the filtering based on the first two columns instead of the whole line, right?
    But the %seen hash would hold the first two columns of all the unique lines in the file, which could be about 1GB. I'm not sure what you mean by tying the hash to disk, could you elaborate? Although I guess I'll have to test how much memory this takes up in reality and whether or not it causes a problem.

    On a different note, I'm having a bit of trouble comprehending the code. First of all, I don't get why you need a hash, not just an array. What's the use of the key-value pairs here? Or is it just that it's easier to see if a certain element is present in a hash than doing a similar lookup in an array?
    And what exactly does the ++ in if (! $seen{ $key }++) do? Add a new record to the %seen hash?

      See tie and DB_File. A tie'd hash simply moves the storage of the hash onto disk, at the (rather huge) cost of access speed.

      A hash is a data structure optimized for fast lookup by a key value. An array can only look up data fast by its index, and the array assumes that all index values are sequential. You haven't told us whether that's the case, so I'm using a hash.

      For the "postfix increment" operator ("++"), see perlop. It is basically $seen{ $key } = $seen{ $key } + 1, except shorter.

        Thanks, that clears up most things.
        I did know how post-increment works on numerical scalars, but this sort of use is new to me, and the perlop page says nothing about adding new records to a hash with ++... But this is what it seems to do.

        Your code works a treat, and the memory use doesn't seem to be too bad. Thanks.

      If the data's tab delimited, you could split your incoming line on tabs

      my @w = split(/\t/,$_);
      and then build a key out of the first two fields and use that as your hash key
      if ( !$seen{$w[0].$w[1]}++ ) {

      Alex / talexb / Toronto

      "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

        As far as I can tell, that does exactly what my regex does. I guess I'll stick with the regex.