This won't be the fastest solution in the world, but it will handle any size of input file provided you have room in memory for the results set. And room on disk for some temporary files. It only requires minimal memory.

It basically does two passes.

  1. Read the file one line at a time and write each column to a separate file.
  2. Then read those files in order, and accumulates the required data.

    If the results set itself poses a memory problem, then the results could be written as they are accumulated.

#! perl -slw use strict; use constant TEMPNAME => 'temp,out.'; my @row = split ' ', scalar <>; my @fhs; open $fhs[ $_ ], '+>', TEMPNAME . $_ for 0 .. $#row; print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row; while( <> ) { @row = split; print { $fhs[ $_ ] } $row[ $_ ] for 0 .. $#row; } my( $i, @cCounts, @iRows, @nonZs ) = ( 0, 0 ); for my $fh ( @fhs ) { seek $fh, 0, 0; my $count = 0; while( <$fh> ) { chomp; next unless 0+$_; ++$count; $iRows[ $i ] = $. - 1; $nonZs[ $i ] = $_; ++$i; } push @cCounts, $cCounts[ $#cCounts ] + $count; } print "@$_" for \( @cCounts, @iRows, @nonZs ); close $_ for @fhs; unlink TEMPNAME . $_ for 0 .. $#fhs; __END__ C:\test>791009 sample.dat 0 2 5 9 10 12 0 1 0 2 4 1 2 3 4 2 1 4 2 3 3 -1 4 4 -3 1 2 2 6 1

The only thing to watch for is if your data contains really huge numbers of columns--greater than ~4000--then some systems may baulk at having that number of files open concurrently.

For comparison purposes it took around 4 minutes to process a 1000 column X 10,000 row dataset. (Although the filesystem was still flushing its caches to disc for several minutes after that completed :)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP PCW It is as I've been saying!(Audio until 20090817)

In reply to Re: Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix by BrowserUk
in thread Capturing Non-Zero Elements, Counts and Indexes of Sparse Matrix by neversaint

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.