Re: Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix

This won't be the fastest solution in the world, but it will handle any size of input file provided you have room in memory for the results set. And room on disk for some temporary files. It only requires minimal memory.

It basically does two passes.

Read the file one line at a time and write each column to a separate file.
Then read those files in order, and accumulates the required data.
If the results set itself poses a memory problem, then the results could be written as they are accumulated.

#! perl -slw
use strict;
use constant TEMPNAME => 'temp,out.';

my @row = split ' ', scalar <>;
my @fhs;
open $fhs[ $_ ], '+>', TEMPNAME . $_ for 0 .. $#row;

print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row;

while( <> ) {
    @row = split;
    print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row;
}

my( $i, @cCounts, @iRows, @nonZs ) = ( 0, 0 );

for my $fh ( @fhs ) {
    seek $fh, 0, 0;
    my $count = 0;
    while( <$fh> ) {
        chomp;
        next unless 0+$_;
        ++$count;
        $iRows[ $i ] = $. - 1;
        $nonZs[ $i ] = $_;
        ++$i;
    }
    push @cCounts, $cCounts[ $#cCounts ] + $count;
}

print "@$_" for \( @cCounts, @iRows, @nonZs );

close $_ for @fhs;
unlink TEMPNAME . $_ for 0 .. $#fhs;

__END__
C:\test>791009 sample.dat
0 2 5 9 10 12
0 1 0 2 4 1 2 3 4 2 1 4
2 3 3 -1 4 4 -3 1 2 2 6 1
[download]

The only thing to watch for is if your data contains really huge numbers of columns--greater than ~4000--then some systems may baulk at having that number of files open concurrently.

For comparison purposes it took around 4 minutes to process a 1000 column X 10,000 row dataset. (Although the filesystem was still flushing its caches to disc for several minutes after that completed :)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP PCW It is as I've been saying!(Audio until 20090817)

Comment on Re: Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix Download Code