Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have multiple files with NGS reads with two columns. The first column contains reads and the second contains the count of that particular read in that file. The files are tab delimited
eg: file1 @ns ATTGCGTTC + //#$@TMSQ 2 @ns GGAGCGTTC + //#$@TMSQ 3
file2 @ns ATTGCGTTC + //#$@#//A 1
output: @ns ATTGCGTTC + //#$@TMSQ 2 1 count:3 @ns GGAGCGTTC + //#$@TMSQ 3 count:3
The program should give the count if second line of each read matches. I have written the following code for this. THe code is working perfectly well but consumes large amount of memory. When I tried the codes for huge files it hangs with no output. Can anyone modify my code so that minimum memory is used.
#!/usr/bin/env perl use strict; use warnings; no warnings qw( numeric ); my %seen; $/ = ""; while (<>) { chomp; my ($key, $value) = split ('\t', $_); my @lines = split /\n/, $key; my $key1 = $lines[1]; $seen{$key1} //= [ $key ]; push (@{$seen{$key1}}, $value); } foreach my $key1 ( sort keys %seen ) { my $tot = 0; my $file_count = @ARGV; for my $val ( @{$seen{$key1}} ) { $tot += ( split /:/, $val )[0]; } if ( @{ $seen{$key1} } >= $file_count) { print join( "\t", @{$seen{$key1}}); print "\tcount:". $tot."\n\n"; } }

Replies are listed 'Best First'.
Re: merge multiple files giving out of memory error
by Eily (Monsignor) on Feb 24, 2017 at 11:08 UTC

    You only seem to be using the number of values and their sum, so you don't have to keep all the values in your arrays. Thanks to the magic of autovification, you can just write:

    # at the top of your program use constant { Key => 0, Sum => 1, Count => 2 }; # How you fill the %seen hash $seen{$key1}[Key] //= $key; ### Edit, turned = $key into //= $key $seen{$key1}[Sum] += $value; # Cumulated value of $key1 $seen{$key1}[Count] ++; # Number of time $key1 has been seen

    If that's still not enough, you can export the memory consumption elsewhere (eg: on your hard drive) by using a database tied to your hash; DBM::Deep seems like a good candidate for that, although I have never used it. This won't make your program any faster though.

    THe code is working perfectly
    Well now that's a mystery, because while (<>) shifts (removes) the file names from @ARGV before opening the files, so @ARGV has to be empty by the time you try to get the file count. my $file_count = @ARGV; should be at the top of the file (and can be done only once). By the way, your array contains the key and all the values, so even if you have just one value, the array would be of size 2.

    About the hanging part, that's to be expected when there's a lot of data to process. You can add a message to tell you how far the processing has gone (and know if it is actually frozen or just not done yet). print "Done processing $ARGV\n" if eof; (at the end of the first loop) will print a message each time the end of a file is reached (see eof).

      Thank you for the help. Modified my code as:
      my %seen; $/ = ""; while (<>) { chomp; my ($key, $value) = split ('\t', $_); my @lines = split /\n/, $key; my $key1 = $lines[1]; $seen{$key1}[Key] //= $key; $seen{$key1}[Sum] += $value; } my $file_count = @ARGV; foreach my $key1 ( keys %seen ) { if ( @{ $seen{$key1} } >= $file_count) { print join( "\t", @{$seen{$key1}}); print "\n\n"; } }
      but please help me also to have the name of the files in which a particular read exists. I mean with the total count it also tells me in which files it is present.

        my $file_count = @ARGV; foreach my $key1 ( keys %seen ) { if ( @{ $seen{$key1} } >= $file_count) { print join( "\t", @{$seen{$key1}}); print "\n\n"; } }
        This still doesn't make sense. If you add print "File count is: $file_count \n"; You'll find that $file_count is always 0, because after reading the files with while (<>), @ARGV is always empty. And you check the size of the array in $seen{$key1}, but it always is 2 (there are two elements, Key, and Sum).

        When you use while (<>) to read from a list of files, the current file is $ARGV.

        # At the top of the file use constant { Key => 0, Sum => 1, Count => 2, # Remove this if you don't use the total count Files => 3 # Should be 2 if Count is not used. };
        # In the read loop $seen{$key1}[Key] //= $key; $seen{$key1}[Sum] += $value; $seen{$key1}[Count]++; # Total count for the number of times t +his value exists $seen{$key1}[Files]{$ARGV}++; # Count in this file

        You don't seem to want a particular format for your output (because you changed it when adapting my proposition), so you could try just dumping the whole structure using either Data::Dumper (nothing to install) or YAML (needs to be installed, but can be nicer to read).

        use Data::Dumper; while (<>) { # Your code here } print Dumper(\%seen);
        Or
        use YAML; while (<>) { # Your code here } print YAML::Dump(\%seen);

Re: merge multiple files giving out of memory error
by Marshall (Canon) on Feb 24, 2017 at 22:42 UTC
    I would add some code to give the status on the console of how many records are read before the program "hangs". Below I print the $rec_count if it is evenly divisible by 1,000. Pick an appropriate number for you...
    my $rec_count=0; while (<>) { $rec_count++; print STDERR "rec=$rec_count\n" if ($rec_count%1000==0); .... }
    From this debug output, you can calculate an estimate of much of the entire data set got read before the program "hung" while creating the %seen hash. Right now we know nothing about that.

    Your first while loop creates a HoA, Hash of Array. I believe that in general, this will require a lot more memory than a simple hash_key=> "string value". If indeed memory is the limit, then instead of a HoA, do more processing and put up with the associated hassle with modifying the "string value".

    The first question and objective is get your required data into a structure that "fits" into memory. If that's not possible, then there are solutions.

    update: This means to get your first "while" loop not to hang. The second loop has some things like "sort keys" that could take a lot memory and which your program doesn't absolutely have to do (other ways to do that function).

Re: merge multiple files giving out of memory error
by choroba (Cardinal) on Feb 26, 2017 at 07:18 UTC
    I tried to play with your code as well. I only stored the first block encountered for each DNA, but added the counts when finding another occurrence. The actual summing is done when printing the result.

    Memory consumption seems to be about 50% of your solution or less, with the running time being the same or a bit shorter.

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use List::Util qw{ sum }; my %h; $/ = q(); while (my $block = <>) { my @lines = split /\n/, $block; my $key = $lines[1]; my ($count) = $lines[3] =~ /\s(\d+)/; unless (exists $h{$key}) { $block =~ s/\n\n?$//; $block =~ s/\s*\d+$//; $h{$key} = $block; } $h{$key} .= "\t$count"; } for my $key (sort keys %h) { my ($match) = $h{$key} =~ /((?:\d+\t*)+)$/; my @counts = $match =~ /\d+/g; my $sum = sum(@counts); say join "\t", $h{$key}, "count:$sum\n"; }

    If you're interested, here's how I created the input data:

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thank you....I tried your first solution but if we have a digit on the fourth line the program add it also. for example:
      @NS ATGGCTG + @#FBH66 2
      the output comes as
      @NS ATGGCTG + @#FBH66 68
      how do I rectify it. and how to check the memory consumption of the script.
        Just require a space before the number:
        my @counts = $match =~ /\s\d+/g;
        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: merge multiple files giving out of memory error
by choroba (Cardinal) on Feb 26, 2017 at 12:00 UTC
    If you're getting Out of memory errors even for my simple solution, you can try the following one. It processes files in bunches, saves the intermediate results, and in each step, it merges the new bunch into the saved result. It's much slower, but it can handle any number of files, provided you have enough memory to load at least two files at the same time.

    The first parameter is the bunch size, setting it to 1 consumes least memory, but is the slowest. Setting it to the number of files makes it almost equivalent to the previous solution (it skips the merging phase completely).

    For DNA length 18, size 10_000, and 1_000 files, the results were the following (my machine has 8GB of RAM):
    old solutionbunch_size=100bunch_size=250bunch_size=500bunch_size=1000
    Memory34.2%9.6%24.5%47.5%86.5%
    Time2m 6s6m 20s4m 12s3m 8s2m 45s

    The code:

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,