elef has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

I need to remove duplicates from very large files. They are tab delimited txt files up to about 1GB or so, with the first two columns holding the data I want to base the filtering on - i.e. if the third or fourth column is different in two records but the first two are identical, they're still duplicates for my purposes.
Now, as far as I can tell, the best way to do things like this in large files is Tie::File, which loads files into a pseudo-array and helps do operations like this without loading the entire file to memory.
I also found array-based dupe stripping solutions like this one. So the two could surely be combined, but I have to admit that my understanding of array operations and Tie::File, especially regarding speed and memory optimizations, is very limited.
So, could you give me some guidance, or, even better, code, on how to do this?

Here's the code found online:
## This function takes the array as parameter ## Returns the array that contains the unique elements sub remove_duplicate_from_array{ my @lists = @_; ## The array holds all the unique elements from list my @list_unique = (); ## Initial checker to remove duplicate my $checker = -12345645312; ## For each sorted elements from the array foreach my $list( sort( @lists ) ){ ## move to next element if same if($checker == $list){ next; } ## replace old one with new found value else{ $checker = $list; push( @list_unique, $checker); } } ## Finally returns the array that contains unique elements return @list_unique; }
One necessary modification I see is replacing
if($checker == $list){

with
$checker =~ /^([^\t]*\t[^\t]*)/; $checker_part = $1; $list =~ /^([^\t]*\t[^\t]*)/; $list_part = $1; if($checker_part == $list_part){

to make the script ignore differences from the third column on.
Apart from that, I would need to change the code to change the original array (@lists) instead of producing a new array (@list_unique), as Tie::File automatically writes changes to the original array to disk - I'm not sure how to do this. The How to find and remove duplicate elements from an array? FAQ item has code for stripping dupes out of an array in-place, but I don't think I can modify that to take only the first two columns into account.
To get decent performance, perhaps I should increase the memory limit and hope that Tie::File will automatically defer writes as needed if it has enough memory to work with - how much is a reasonable amount of memory to allocate?
Also, I would like the order of records/lines to remain unchanged if possible. I'm not sure if the foreach my $list( sort( @lists ) ){ line means that I will get alphabetically sorted output at the end, but I suspect it does, which wouldn't be ideal.

Apart from getting the task solved, it would be nice to optimize speed and memory use, and free up the memory afterwards.

Thanks for any comments, advice or code.

Replies are listed 'Best First'.
Re: Filtering very large files using Tie::File
by Corion (Patriarch) on Nov 26, 2010 at 17:08 UTC

    For filtering duplicates, you need only to remember what elements you already have written to the file. You don't need Tie::File, just a loop that uses a hash to remember what lines with what keys have already been written to the file. If memory is still scarce, you can tie that hash to disk:

    open my $in, '<', $infile or die "Couldn't open '$infile': $!"; open my $out, '>', $outfile or die "Couldn't create '$outfile': $!"; my %seen; while (<$in>) { my $key = $in; # change this to whatever key generation you need if (! $seen{ $key }++) { print $out $in; }; };
      Thanks, this is starting to look a lot simpler than I imagined.

      So I could just do
      $in =~ /^([^\t]*\t[^\t]*)/; $key = $1;
      and that would take care of doing the filtering based on the first two columns instead of the whole line, right?
      But the %seen hash would hold the first two columns of all the unique lines in the file, which could be about 1GB. I'm not sure what you mean by tying the hash to disk, could you elaborate? Although I guess I'll have to test how much memory this takes up in reality and whether or not it causes a problem.

      On a different note, I'm having a bit of trouble comprehending the code. First of all, I don't get why you need a hash, not just an array. What's the use of the key-value pairs here? Or is it just that it's easier to see if a certain element is present in a hash than doing a similar lookup in an array?
      And what exactly does the ++ in if (! $seen{ $key }++) do? Add a new record to the %seen hash?

        See tie and DB_File. A tie'd hash simply moves the storage of the hash onto disk, at the (rather huge) cost of access speed.

        A hash is a data structure optimized for fast lookup by a key value. An array can only look up data fast by its index, and the array assumes that all index values are sequential. You haven't told us whether that's the case, so I'm using a hash.

        For the "postfix increment" operator ("++"), see perlop. It is basically $seen{ $key } = $seen{ $key } + 1, except shorter.

        If the data's tab delimited, you could split your incoming line on tabs

        my @w = split(/\t/,$_);
        and then build a key out of the first two fields and use that as your hash key
        if ( !$seen{$w[0].$w[1]}++ ) {

        Alex / talexb / Toronto

        "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Filtering very large files using Tie::File
by ikegami (Patriarch) on Nov 26, 2010 at 22:40 UTC

    Now, as far as I can tell, the best way to do things like this in large files is Tie::File, which loads files into a pseudo-array and helps do operations like this without loading the entire file to memory.

    The basic means of reading from a file (read, readline) do not read the entire file into memory, so that aspect of Tie::File is not special. What you are doing by using Tie::File is wasting time and memory for features you don't even need.

    I'm not sure if the foreach my $list( sort( @lists ) ){ line means that I will get alphabetically sorted output at the end, but I suspect it does, which wouldn't be ideal.

    sort @tied would cause all of the file to be loaded into memory so it can be passed to sort. Thankfully, you're not passing the tied array there, but that probably means you are doing remove_duplicate_from_array(@tied).

    remove_duplicate_from_array(@tied) would cause all of the file to be loaded into memory so it can be placed in @_. Then the first thing you do in remove_duplicate_from_array is to create a copy of @_

    ouch.

      The basic means of reading from a file (read, readline) do not read the entire file into memory, so that aspect of Tie::File is not special.

      Well, I know that, and while (<FILE>) is what I use most of the time. I just thought that wouldn't work for this purpose without sorting the file alphabetically first. Corion's solution cleared that up, so the Tie::File idea went out the window after the first post in the thread.
      Just for the record, here's the code I ended up with, including some reporting:
      open (ALIGNED, "<:encoding(UTF-8)", "${filename}.txt") or die "Can't o +pen aligned file for reading: $!"; open (ALIGNED_MOD, ">:encoding(UTF-8)", "${filename}_mod.txt") or die +"Can't open file for writing: $!"; if ($delete_dupes eq "y") { my %seen; # hash that contains uique records (hash lookups +are faster than array lookups) my $key; # key to be put in hash while (<ALIGNED>) { /^([^\t]*\t[^\t]*)/; # only watch first two fields chomp ($key = $1); # only watch first two fields print ALIGNED_MOD $_ if (! $seen{ $key }++); # add to hash, an +d if new, print to file } my $unfiltered_number = $.; my $filtered_number = keys %seen; print "\n\n-------------------------------------------------"; print "\n\nSegment numbers before and after filtering out dupes: $ +unfiltered_number -> $filtered_number\n"; print LOG "\nFiltered out dupes: $unfiltered_number -> $filtered_n +umber"; undef %seen; # free up memory close ALIGNED; close ALIGNED_MOD; rename ("${filename}_mod.txt", "${filename}.txt") or die "Can't re +name file: $!"; }

        I just thought that wouldn't work for this purpose without sorting the file alphabetically first.

        That makes no sense. If you can't do it by reading the file a line at a time using <>, what makes you think you can do it by reading the file a line at a time using Tie::File. Conversely, if you can do it by reading a line at a time using Tie::File, you can do it by reading a line at a time using <>.

Re: Filtering very large files using Tie::File
by ikegami (Patriarch) on Nov 27, 2010 at 21:12 UTC

    The following should be fast, it allows arbitrary large files, and it preserves order:

    perl -ne'print "$.\t$_"' input.txt \ | sort -k 2 \ | uniq -f 1 \ | sort \ | cut -f 2- \ > output.txt
    • Prefix each line with its original line number,
    • sort (ignoring the prefixed line number) to group duplicates,
    • remove duplicates (ignoring the prefixed line number),
    • restore the original order, and finally
    • remove the prefixed line number.
Re: Filtering very large files using Tie::File
by biohisham (Priest) on Nov 26, 2010 at 18:28 UTC
    I also found array-based dupe stripping solutions.....
    C:\Documents and Settings\franciscor>perl - use List::MoreUtils qw(uniq); my @list = qw(3 1 2 1 3 4 1 5 1 6 8 1 2 1 2 7 1 1 2 3 4 6 1); my @unique = uniq(@list); print "@unique"; __END__ 3 1 2 4 5 6 8 7


    Excellence is an Endeavor of Persistence. A Year-Old Monk :D .
      Thanks, that's pretty elegant, but it filters on the whole record, not the first two fields as I need it to, so I'll stick with the solution suggested by Corion.
Re: Filtering very large files using Tie::File
by NiJo (Friar) on Nov 26, 2010 at 19:09 UTC
    It seems you are carrying around a lot of extra data and perform extra I/O. An alternative but destructive approach could reduce disk I/O even further. Maybe you can even tolerate a very remote chance of dropping a unique line, say 1 in 1E30.

    Besides the using techniques already mentioned I'd reduce the data by extracting e. g. the md4 checksum of the relevant columns. Byte offset and length of each line also go into this intermediate file.

    The intermediate file gets the opposite treatment to list all but one of the duplicates in a group. The extracted information is used to overwrite duplicate lines in the original 1 GB file e. g. with spaces using seek() for positioning. This is the computer analogon to crossing lines out on paper with a ruler and pen.

    A trivial pipe filter is then applied to skip the overwritten lines when feeding the result to the next application.

    The command line approach is also worth checking out on GNU tools:  sort -u -t "\t" -k1,2 -S 100M -o out.file

      Thanks, that sounds convincing.
      It's way over my head, though, and I don't think the advantages are worth the effort here.
      I need this to work on very large files "just in case", with 1GB being on the extreme upper end of what I expect it'll need to handle. Most of the time, it will be crunching much smaller files with well under 50,000 records.