The %seen hash is growing to the point that your memory is saturated.
My initial impression is that this is probably a good task for a database -- database engines know how to manage, index, and search large data sets. But I also had another thought. A file-based approach (discussed below) is going to have many performance drawbacks, but one drawback it will not have is running out of RAM.
What if instead of storing the $key followed by a list of $value's in each hash element, you instead wrote the $key into its own file named by $key1, and then append $value to that file? Here's an example:
$/ = "";
while(<>) {
chomp;
my ($k, $v) = split /\t/;
my $k1 = (split /\n/, $k, 3)[1];
my $out = -e $k1 ? "$value\n" : "$key:$value\n";
open my $ofh, '>>', $k1 or die $!;
print $ofh $out;
close $ofh or die $!;
}
foreach my $file (sort glob('./')) {
open my $ifh, '<', $file or die $!;
# ... and so on...
}
This approach uses the data storage device as a seen hash, and each file as an element. The file stores exactly what you were pushing into your %seen elements. Now you won't get bogged down by RAM. Of course on the other hand you will become IO bound, and our opening and closing of files inside of a loop is sadly inefficient. But at least you won't be spending all your time in swap (in other words, you were IO bound anyway, at least this way you're controlling what gets committed to storage and when to read it).
If this is too complex to work out (it does seem like a lot of effort, after all), you might just decide it's better to commit your data to an SQL-based database where you can set up indices, and perform queries, allowing the database implementation worry about how to manage your memory and storage resources. If it's just being run once, committing everything to a database is probably not efficient either. But if you anticipate running queries on the data set (even an ever-expanding data set) more than once, the database approach is likely optimal.
|