comment on

I have the following code to compare the sequence in multiple large files. The below code works fine with some files but I want it to execute for any number of files however large it may be. I tried executing it with more than 10GB data but the program gets killed. Example of a file is shown below. The count is summed only if the second line of each set matches in both files. I want to give the sum of those set whose second line matches in all the files.


data1.txt
@NS500278
AGATCNGAA
+
=CCGGGCGG   1

@NS500278
TACAGNGAG
+
CCCGGGGGG   2

@NS500278
CATTGNACC
+
CCCGGGGGG   3

data2.txt

@NS500278
AGATCNGAA
+
=CCGGGCGG   1

@NS500278
CATTGNACC
+
CCCG#GGG#   2

@NS500278
TACAGNGAG
+
CC=GGG#GG   2


output:

@NS500278
AGATCNGAA
+
=GGGGGCCG     1:data1.txt.out    1:data2.txt.out    count:2

@NS500278
CATTGNACC
+
CCCGGGGGG 3:data1.txt.out     2:data2.txt.out    count:5

@NS500278
TACAGNGAG
+
CCCGGGGGG 2:data1.txt.out     2:data2.txt.out    count:4
[download]

My code is:

#!/usr/bin/env perl
use strict;
use warnings;

my %seen;

$/ = "";
while (<>) {
    chomp;
    my ($key, $value) = split ('\t', $_);

    my @lines = split /\n/, $key;
    my $key1 = $lines[1];

    $seen{$key1} //= [ $key ];
    push (@{$seen{$key1}}, $value);
}

foreach my $key1 ( sort keys %seen ) {

my $tot = 0;
my $file_count = @ARGV;
for my $val ( @{$seen{$key1}} ) {
        $tot += ( split /:/, $val )[0];
    }   
    
if ( @{ $seen{$key1} } >= $file_count) {


        print join( "\t", @{$seen{$key1}});
        print "\tcount:". $tot."\n\n";
    }
}
[download]

In reply to Re^2: storing hash in temporary files to save memory usage by Anonymous Monk
in thread storing hash in temporary files to save memory usage by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.