in reply to Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix
This won't be the fastest solution in the world, but it will handle any size of input file provided you have room in memory for the results set. And room on disk for some temporary files. It only requires minimal memory.
It basically does two passes.
If the results set itself poses a memory problem, then the results could be written as they are accumulated.
#! perl -slw use strict; use constant TEMPNAME => 'temp,out.'; my @row = split ' ', scalar <>; my @fhs; open $fhs[ $_ ], '+>', TEMPNAME . $_ for 0 .. $#row; print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row; while( <> ) { @row = split; print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row; } my( $i, @cCounts, @iRows, @nonZs ) = ( 0, 0 ); for my $fh ( @fhs ) { seek $fh, 0, 0; my $count = 0; while( <$fh> ) { chomp; next unless 0+$_; ++$count; $iRows[ $i ] = $. - 1; $nonZs[ $i ] = $_; ++$i; } push @cCounts, $cCounts[ $#cCounts ] + $count; } print "@$_" for \( @cCounts, @iRows, @nonZs ); close $_ for @fhs; unlink TEMPNAME . $_ for 0 .. $#fhs; __END__ C:\test>791009 sample.dat 0 2 5 9 10 12 0 1 0 2 4 1 2 3 4 2 1 4 2 3 3 -1 4 4 -3 1 2 2 6 1
The only thing to watch for is if your data contains really huge numbers of columns--greater than ~4000--then some systems may baulk at having that number of files open concurrently.
For comparison purposes it took around 4 minutes to process a 1000 column X 10,000 row dataset. (Although the filesystem was still flushing its caches to disc for several minutes after that completed :)
|
|---|