in reply to Re: Filtering very large files using Tie::File
in thread Filtering very large files using Tie::File

Thanks, this is starting to look a lot simpler than I imagined.

So I could just do
$in =~ /^([^\t]*\t[^\t]*)/; $key = $1;
and that would take care of doing the filtering based on the first two columns instead of the whole line, right?
But the %seen hash would hold the first two columns of all the unique lines in the file, which could be about 1GB. I'm not sure what you mean by tying the hash to disk, could you elaborate? Although I guess I'll have to test how much memory this takes up in reality and whether or not it causes a problem.

On a different note, I'm having a bit of trouble comprehending the code. First of all, I don't get why you need a hash, not just an array. What's the use of the key-value pairs here? Or is it just that it's easier to see if a certain element is present in a hash than doing a similar lookup in an array?
And what exactly does the ++ in if (! $seen{ $key }++) do? Add a new record to the %seen hash?

Replies are listed 'Best First'.
Re^3: Filtering very large files using Tie::File
by Corion (Patriarch) on Nov 26, 2010 at 17:45 UTC

    See tie and DB_File. A tie'd hash simply moves the storage of the hash onto disk, at the (rather huge) cost of access speed.

    A hash is a data structure optimized for fast lookup by a key value. An array can only look up data fast by its index, and the array assumes that all index values are sequential. You haven't told us whether that's the case, so I'm using a hash.

    For the "postfix increment" operator ("++"), see perlop. It is basically $seen{ $key } = $seen{ $key } + 1, except shorter.

      Thanks, that clears up most things.
      I did know how post-increment works on numerical scalars, but this sort of use is new to me, and the perlop page says nothing about adding new records to a hash with ++... But this is what it seems to do.

      Your code works a treat, and the memory use doesn't seem to be too bad. Thanks.

        the perlop page says nothing about adding new records to a hash with ++
        This commonly seen Perl idiom works due to Autovivification (the automatic creation of a variable reference when an undefined value is dereferenced). Autovivification is unique to Perl; in other languages you'd need to create the item as a separate operation before incrementing it.

Re^3: Filtering very large files using Tie::File
by talexb (Chancellor) on Nov 26, 2010 at 17:57 UTC

    If the data's tab delimited, you could split your incoming line on tabs

    my @w = split(/\t/,$_);
    and then build a key out of the first two fields and use that as your hash key
    if ( !$seen{$w[0].$w[1]}++ ) {

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      As far as I can tell, that does exactly what my regex does. I guess I'll stick with the regex.