Dear fellow monks,

I need to remove duplicates from very large files. They are tab delimited txt files up to about 1GB or so, with the first two columns holding the data I want to base the filtering on - i.e. if the third or fourth column is different in two records but the first two are identical, they're still duplicates for my purposes.
Now, as far as I can tell, the best way to do things like this in large files is Tie::File, which loads files into a pseudo-array and helps do operations like this without loading the entire file to memory.
I also found array-based dupe stripping solutions like this one. So the two could surely be combined, but I have to admit that my understanding of array operations and Tie::File, especially regarding speed and memory optimizations, is very limited.
So, could you give me some guidance, or, even better, code, on how to do this?

Here's the code found online:
## This function takes the array as parameter ## Returns the array that contains the unique elements sub remove_duplicate_from_array{ my @lists = @_; ## The array holds all the unique elements from list my @list_unique = (); ## Initial checker to remove duplicate my $checker = -12345645312; ## For each sorted elements from the array foreach my $list( sort( @lists ) ){ ## move to next element if same if($checker == $list){ next; } ## replace old one with new found value else{ $checker = $list; push( @list_unique, $checker); } } ## Finally returns the array that contains unique elements return @list_unique; }
One necessary modification I see is replacing
if($checker == $list){

with
$checker =~ /^([^\t]*\t[^\t]*)/; $checker_part = $1; $list =~ /^([^\t]*\t[^\t]*)/; $list_part = $1; if($checker_part == $list_part){

to make the script ignore differences from the third column on.
Apart from that, I would need to change the code to change the original array (@lists) instead of producing a new array (@list_unique), as Tie::File automatically writes changes to the original array to disk - I'm not sure how to do this. The How to find and remove duplicate elements from an array? FAQ item has code for stripping dupes out of an array in-place, but I don't think I can modify that to take only the first two columns into account.
To get decent performance, perhaps I should increase the memory limit and hope that Tie::File will automatically defer writes as needed if it has enough memory to work with - how much is a reasonable amount of memory to allocate?
Also, I would like the order of records/lines to remain unchanged if possible. I'm not sure if the foreach my $list( sort( @lists ) ){ line means that I will get alphabetically sorted output at the end, but I suspect it does, which wouldn't be ideal.

Apart from getting the task solved, it would be nice to optimize speed and memory use, and free up the memory afterwards.

Thanks for any comments, advice or code.

In reply to Filtering very large files using Tie::File by elef

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.