Re: Filtering very large files using Tie::File

For filtering duplicates, you need only to remember what elements you already have written to the file. You don't need Tie::File, just a loop that uses a hash to remember what lines with what keys have already been written to the file. If memory is still scarce, you can tie that hash to disk:

open my $in, '<', $infile
    or die "Couldn't open '$infile': $!";
open my $out, '>', $outfile
    or die "Couldn't create '$outfile': $!";

my %seen;
while (<$in>) {
    my $key = $in; # change this to whatever key generation you need
    if (! $seen{ $key }++) {
        print $out $in;
    };
};
[download]

Comment on Re: Filtering very large files using Tie::File Download Code

Replies are listed 'Best First'.
Re^2: Filtering very large files using Tie::File by elef (Friar) on Nov 26, 2010 at 17:41 UTC
Thanks, this is starting to look a lot simpler than I imagined. So I could just do `$in =~ /^([^\t]\t[^\t])/; $key = $1;` [download] and that would take care of doing the filtering based on the first two columns instead of the whole line, right? But the %seen hash would hold the first two columns of all the unique lines in the file, which could be about 1GB. I'm not sure what you mean by tying the hash to disk, could you elaborate? Although I guess I'll have to test how much memory this takes up in reality and whether or not it causes a problem. On a different note, I'm having a bit of trouble comprehending the code. First of all, I don't get why you need a hash, not just an array. What's the use of the key-value pairs here? Or is it just that it's easier to see if a certain element is present in a hash than doing a similar lookup in an array? And what exactly does the ++ in `if (! $seen{ $key }++)` do? Add a new record to the %seen hash?	[reply] [d/l] [select]
Re^3: Filtering very large files using Tie::File by Corion (Patriarch) on Nov 26, 2010 at 17:45 UTC
See tie and DB_File. A tie'd hash simply moves the storage of the hash onto disk, at the (rather huge) cost of access speed. A hash is a data structure optimized for fast lookup by a key value. An array can only look up data fast by its index, and the array assumes that all index values are sequential. You haven't told us whether that's the case, so I'm using a hash. For the "postfix increment" operator ("++"), see perlop. It is basically `$seen{ $key } = $seen{ $key } + 1`, except shorter.	[reply] [d/l]
Re^4: Filtering very large files using Tie::File by elef (Friar) on Nov 26, 2010 at 18:59 UTC
Thanks, that clears up most things. I did know how post-increment works on numerical scalars, but this sort of use is new to me, and the perlop page says nothing about adding new records to a hash with ++... But this is what it seems to do. Your code works a treat, and the memory use doesn't seem to be too bad. Thanks.	[reply]
Re^5: Filtering very large files using Tie::File by eyepopslikeamosquito (Archbishop) on Nov 26, 2010 at 20:41 UTC
Re^6: Filtering very large files using Tie::File by Corion (Patriarch) on Nov 26, 2010 at 20:56 UTC
Some notes below your chosen depth have not been shown here
Re^3: Filtering very large files using Tie::File by talexb (Chancellor) on Nov 26, 2010 at 17:57 UTC
If the data's tab delimited, you could split your incoming line on tabs `my @w = split(/\t/,$_);` [download] and then build a key out of the first two fields and use that as your hash key `if ( !$seen{$w[0].$w[1]}++ ) {` [download] Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply] [d/l] [select]
Re^4: Filtering very large files using Tie::File by elef (Friar) on Nov 26, 2010 at 18:48 UTC
As far as I can tell, that does exactly what my regex does. I guess I'll stick with the regex.	[reply]