Re^3: Memory issue with cancer data (analogy)

See how you get on with this:

#! perl -sw
use strict;

<>; #discard header line
my %table;
my %lengths;
while( <> ) {
    my( $gene, $id, undef, $site, $len ) = split;
    my( $pos ) = $site =~ m[(\d+)];       ## extract the digits from t
+he site
    undef $table{ $gene }{ $pos }{ $id }; ## adds the id as a key with
+ no value (saves space!)
    $lengths{ $gene } = $len;             ## Save the gene lengths for
+ later
}

#print 'output header line here if required';
for my $gene ( sort keys %table ) {
    print "$gene";
    my $p = 1;
    for my $pos ( sort{ $a <=> $b } keys %{ $table{ $gene } } ) {
        print "\t0" x ( $pos - $p ), "\t", scalar keys %{ $table{ $gen
+e }{ $pos } };
        $p = $pos + 1;
    }
    print "\t0" x ( $lengths{ $gene } - $p ), "\n";
}
[download]

Invoke it as thisScript.pl < theInputFile > theOutputFile. It shouldn't take more than a minute or two to run.

It'll probably need tweaking. Like adding an appropriate header line if that is a requirement. I couldn't work out what would be needed as all the output lines will be different lengths, as the genes are different lengths.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^3: Memory issue with cancer data (analogy) Select or Download Code