merge multiple files giving out of memory error

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: merge multiple files giving out of memory error
by Eily (Monsignor) on Feb 24, 2017 at 11:08 UTC

You only seem to be using the number of values and their sum, so you don't have to keep all the values in your arrays. Thanks to the magic of autovification, you can just write:

# at the top of your program
use constant { Key => 0, Sum => 1, Count => 2 };

# How you fill the %seen hash
$seen{$key1}[Key] //= $key; ### Edit, turned = $key into //= $key
$seen{$key1}[Sum] += $value; # Cumulated value of $key1
$seen{$key1}[Count] ++; # Number of time $key1 has been seen
[download]

If that's still not enough, you can export the memory consumption elsewhere (eg: on your hard drive) by using a database tied to your hash; DBM::Deep seems like a good candidate for that, although I have never used it. This won't make your program any faster though.

THe code is working perfectly

while (<>)

shifts

my $file_count = @ARGV;

About the hanging part, that's to be expected when there's a lot of data to process. You can add a message to tell you how far the processing has gone (and know if it is actually frozen or just not done yet). print "Done processing $ARGV\n" if eof; (at the end of the first loop) will print a message each time the end of a file is reached (see eof).

[reply]
[d/l]
[select]

Re^2: merge multiple files giving out of memory error

by Anonymous Monk on Feb 27, 2017 at 11:34 UTC


my %seen;

$/ = "";
while (<>) {
    chomp;
    my ($key, $value) = split ('\t', $_);

    my @lines = split /\n/, $key;
    my $key1 = $lines[1];

$seen{$key1}[Key] //= $key;
$seen{$key1}[Sum] += $value;



}
my $file_count = @ARGV;
foreach my $key1 ( keys %seen ) {
  
    if ( @{ $seen{$key1} } >= $file_count) {


        



        print join( "\t", @{$seen{$key1}});
        print "\n\n";
    }
}
[download]

[reply]
[d/l]

Re^3: merge multiple files giving out of memory error

by Eily (Monsignor) on Feb 27, 2017 at 12:31 UTC

my $file_count = @ARGV;
foreach my $key1 ( keys %seen ) {
    if ( @{ $seen{$key1} } >= $file_count) {
        print join( "\t", @{$seen{$key1}});
        print "\n\n";
    }
}
[download]

print "File count is: $file_count \n";

while (<>)

When you use while (<>) to read from a list of files, the current file is $ARGV.

# At the top of the file
use constant { 
    Key => 0,
    Sum => 1,
    Count => 2, # Remove this if you don't use the total count
    Files => 3  # Should be 2 if Count is not used.
};
[download]

# In the read loop
$seen{$key1}[Key] //= $key;
$seen{$key1}[Sum] += $value;
$seen{$key1}[Count]++;         # Total count for the number of times t
+his value exists
$seen{$key1}[Files]{$ARGV}++;  # Count in this file
[download]

You don't seem to want a particular format for your output (because you changed it when adapting my proposition), so you could try just dumping the whole structure using either Data::Dumper (nothing to install) or YAML (needs to be installed, but can be nicer to read).

use Data::Dumper;
while (<>)
{
  # Your code here
}
print Dumper(\%seen);
[download]

use YAML;
while (<>)
{
  # Your code here
}
print YAML::Dump(\%seen);
[download]

[reply]
[d/l]
[select]

Re^4: merge multiple files giving out of memory error

by Anonymous Monk on Mar 02, 2017 at 09:01 UTC

Re^5: merge multiple files giving out of memory error

by Eily (Monsignor) on Mar 02, 2017 at 09:17 UTC

Re: merge multiple files giving out of memory error
by Marshall (Canon) on Feb 24, 2017 at 22:42 UTC

my $rec_count=0;

while (<>) {
    $rec_count++;
    print STDERR "rec=$rec_count\n" if ($rec_count%1000==0);
    ....   
}
[download]

Your first while loop creates a HoA, Hash of Array. I believe that in general, this will require a lot more memory than a simple hash_key=> "string value". If indeed memory is the limit, then instead of a HoA, do more processing and put up with the associated hassle with modifying the "string value".

The first question and objective is get your required data into a structure that "fits" into memory. If that's not possible, then there are solutions.

update: This means to get your first "while" loop not to hang. The second loop has some things like "sort keys" that could take a lot memory and which your program doesn't absolutely have to do (other ways to do that function).

[reply]
[d/l]

Re: merge multiple files giving out of memory error
by choroba (Cardinal) on Feb 26, 2017 at 07:18 UTC

Memory consumption seems to be about 50% of your solution or less, with the running time being the same or a bit shorter.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use List::Util qw{ sum };

my %h;
$/ = q();
while (my $block = <>) {
    my @lines = split /\n/, $block;
    my $key = $lines[1];
    my ($count) = $lines[3] =~ /\s(\d+)/;
    unless (exists $h{$key}) {
        $block =~ s/\n\n?$//;
        $block =~ s/\s*\d+$//;
        $h{$key} = $block;
    }
    $h{$key} .= "\t$count";
}

for my $key (sort keys %h) {
    my ($match) = $h{$key} =~ /((?:\d+\t*)+)$/;
    my @counts = $match =~ /\d+/g;
    my $sum = sum(@counts);
    say join "\t", $h{$key}, "count:$sum\n";
}
[download]

If you're interested, here's how I created the input data: