comparing multiple files

umaykulsum has asked for the wisdom of the Perl Monks concerning the following question:

I have a script to compare multiple files and count the second column if first column matches. It gives me the correct output when I run it with 20 files of 6.1 GB in total....but when I do this for 32 files or more (11.6 GB), the script runs but with no output and the system hangs. Can anyone help me with this situation.


This is my script:


#!/usr/bin/env perl
use strict;
use warnings;

my %seen;

$/ = "";
while (<>) {
    chomp;
    my ($key, $value) = split ('\t', $_);

    my @lines = split /\n/, $key;
    my $key1 = $lines[1];

    $seen{$key1} //= [ $key ];
    push (@{$seen{$key1}}, $value);
}

foreach my $key1 ( sort keys %seen ) {
my $tot = 0;
my $file_count = @ARGV;
for my $val ( @{$seen{$key1}} ) {
        $tot += ( split /:/, $val )[0];
    }   
    
if ( @{ $seen{$key1} } >= $file_count) {


        print join( "\t", @{$seen{$key1}});
        print "\tcount:". $tot."\n\n";
    }
}
[download]

Comment on comparing multiple files Download Code

Replies are listed 'Best First'.
Re: comparing multiple files by davido (Cardinal) on Jun 17, 2016 at 06:20 UTC
The `%seen` hash is growing to the point that your memory is saturated. My initial impression is that this is probably a good task for a database -- database engines know how to manage, index, and search large data sets. But I also had another thought. A file-based approach (discussed below) is going to have many performance drawbacks, but one drawback it will not have is running out of RAM. What if instead of storing the $key followed by a list of $value's in each hash element, you instead wrote the $key into its own file named by $key1, and then append $value to that file? Here's an example: `$/ = ""; while(<>) { chomp; my ($k, $v) = split /\t/; my $k1 = (split /\n/, $k, 3)[1]; my $out = -e $k1 ? "$value\n" : "$key:$value\n"; open my $ofh, '>>', $k1 or die $!; print $ofh $out; close $ofh or die $!; } foreach my $file (sort glob('./')) { open my $ifh, '<', $file or die $!; # ... and so on... }` [download] This approach uses the data storage device as a seen hash, and each file as an element. The file stores exactly what you were pushing into your %seen elements. Now you won't get bogged down by RAM. Of course on the other hand you will become IO bound, and our opening and closing of files inside of a loop is sadly inefficient. But at least you won't be spending all your time in swap (in other words, you were IO bound anyway, at least this way you're controlling what gets committed to storage and when to read it). If this is too complex to work out (it does seem like a lot of effort, after all), you might just decide it's better to commit your data to an SQL-based database where you can set up indices, and perform queries, allowing the database implementation worry about how to manage your memory and storage resources. If it's just being run once, committing everything to a database is probably not efficient either. But if you anticipate running queries on the data set (even an ever-expanding data set) more than once, the database approach is likely optimal. Dave	[reply] [d/l] [select]
Re: comparing multiple files by Laurent_R (Canon) on Jun 17, 2016 at 17:57 UTC
Not entirely sure if this is appropriate for your specific problem, but I am dealing very regularly with huge files too large to fit in a hash in memory, for the purpose of removing duplicates, comparing two files, etc. I found that the fastest way is very often to sort the files according to the comparison key, using the Unix sort utility, and then read them sequentially. For what I am doing, this is consistently more than one order of magnitude faster than loading the data into a database. The algorithm is then more complicated than just using a `%seen` hash, but nothing really hard.	[reply] [d/l]