Dear fellow monks,
I need to remove duplicates from very large files. They are tab delimited txt files up to about 1GB or so, with the first two columns holding the data I want to base the filtering on - i.e. if the third or fourth column is different in two records but the first two are identical, they're still duplicates for my purposes.
Now, as far as I can tell, the best way to do things like this in large files is Tie::File, which loads files into a pseudo-array and helps do operations like this without loading the entire file to memory.
I also found array-based dupe stripping solutions like
this one.
So the two could surely be combined, but I have to admit that my understanding of array operations and Tie::File, especially regarding speed and memory optimizations, is very limited.
So, could you give me some guidance, or, even better, code, on how to do this?
Here's the code found online:
## This function takes the array as parameter
## Returns the array that contains the unique elements
sub remove_duplicate_from_array{
my @lists = @_;
## The array holds all the unique elements from list
my @list_unique = ();
## Initial checker to remove duplicate
my $checker = -12345645312;
## For each sorted elements from the array
foreach my $list( sort( @lists ) ){
## move to next element if same
if($checker == $list){
next;
}
## replace old one with new found value
else{
$checker = $list;
push( @list_unique, $checker);
}
}
## Finally returns the array that contains unique elements
return @list_unique;
}
One necessary modification I see is replacing
if($checker == $list){
with
$checker =~ /^([^\t]*\t[^\t]*)/;
$checker_part = $1;
$list =~ /^([^\t]*\t[^\t]*)/;
$list_part = $1;
if($checker_part == $list_part){
to make the script ignore differences from the third column on.
Apart from that, I would need to change the code to change the original array (@lists) instead of producing a new array (@list_unique), as Tie::File automatically writes changes to the original array to disk - I'm not sure how to do this. The
How to find and remove duplicate elements from an array? FAQ item has code for stripping dupes out of an array in-place, but I don't think I can modify that to take only the first two columns into account.
To get decent performance, perhaps I should increase the memory limit and hope that Tie::File will automatically defer writes as needed if it has enough memory to work with - how much is a reasonable amount of memory to allocate?
Also, I would like the order of records/lines to remain unchanged if possible. I'm not sure if the
foreach my $list( sort( @lists ) ){ line means that I will get alphabetically sorted output at the end, but I suspect it does, which wouldn't be ideal.
Apart from getting the task solved, it would be nice to optimize speed and memory use, and free up the memory afterwards.
Thanks for any comments, advice or code.