in reply to Re: upper or lower triangular matrix to full
in thread upper or lower triangular matrix to full

The following still takes 90s for size 10_000, and 800s for size 20_000 on my machine (with some random tuning of the LOAD_AT_ONCE constant). 640_000 would still take several years. Nevertheless, I haven't been able to find a faster solution.

#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $LOAD_AT_ONCE = 500; my $filename = shift; open my $IN, '<', $filename or die $!; my @part; sub out { for my $i (0 .. $#{ $part[0] }) { open my $OUT, '>>', "$$.$i" or die $!; say {$OUT} join "\n", map $_->[$i], @part; } @part = (); } while (<$IN>) { push @part, [ split ' ' ]; out() if @part == $LOAD_AT_ONCE; print STDERR "Phase 1: ", $IN->input_line_number, "\r"; } out() if @part; print STDERR "\n"; my @files = glob "$$.*"; seek $IN, 0, 0; for my $i (0 .. $#files) { print STDERR "Phase 2: $i\r"; open my $COL, '<', "$$.$i" or die $!; while (<$COL>) { chomp; last if $_ eq 'NA'; print "$_ "; } my @rest = (split ' ', <$IN>)[ 1 + $i .. $#files ]; say "@rest"; } unlink glob "$$.*";

I used the following code to generate the input matrix:

my $SIZE = 1000; sub create_matrix { my ($filename) = @_; open my $OUT, '>', $filename or die $!; for my $i (1 .. $SIZE) { for my $j ( 1 .. $SIZE ) { print {$OUT} $i <= $j ? $i * $j : 'NA'; print {$OUT} ' ' unless $SIZE == $j; } print {$OUT} "\n"; } }

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^3: upper or lower triangular matrix to full
by holli (Abbot) on Sep 02, 2017 at 07:05 UTC
    You and others in this thread silently assume "matrix" meaning a somehow delimited data file. What if the file looks like this:
    0001 1202 3030 ... 8491 9382 9381 ...
    In such a fixed lenght case you don't need any memory (well, kinda) and can just do the task by seek()ing the appropriate positions on disk.

    We won't know unless the OP tells us.


    holli

    You can lead your users to water, but alas, you cannot drown them.
      I originally started with
      sub fill_matrix { my ($in) = @_; open my $IN, '<', $in or die $!; my @index = (0); push @index, tell $IN while <$IN>; pop @index; for my $line_no (0 .. $#index) { print STDERR "$line_no\r"; for my $idx (0 .. $line_no - 1) { seek $IN, $index[$idx], 0; my $line = <$IN>; print +(split ' ', $line, $line_no + 2)[$line_no], ' '; } seek $IN, $index[$line_no], 0; my $line = <$IN>; print +(split ' ', $line, $line_no + 1)[-1]; } }

      but it was much slower: 28s for SIZE 1000, 280s for SIZE 2000.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        If the fields were fixed-width, you wouldn't need to read an entire line to get a single value. That really starts to bite you when the lines get long. But the seeking is still going to kill performance once you run out of disk cache.
Re^3: upper or lower triangular matrix to full
by Anonymous Monk on Sep 02, 2017 at 00:46 UTC
    It helps if you combine several columns into each of the tempfiles. But yeah, slinging around this much data is... challenging.