savita has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a file that has 460000 rows and 460000 columns. The upper triangle of this matrix has the actual values I'm interested in and the other cells contain "NA". My aim is to fill in the lower triangle as well. I tried reading in the file into an array of arrays and assigning  mat[j,i]=mat[i,j] where i and j are row and column indices, however I quickly ran out of memory just reading the file into a 'matrix'! I can't think of a way of assigning values to the lower triangle without reading in the entire file. Does someone know a way of achieving this? Any pointers would be appreciated. Many Thanks!

Replies are listed 'Best First'.
Re: upper or lower triangular matrix to full
by haukex (Archbishop) on Sep 01, 2017 at 14:53 UTC

    Please see How do I post a question effectively? and provide a Short, Self-Contained, Correct Example with a small matrix (say, 10x10) that demonstrates what you are trying to do.

    If you're running out of memory just reading the file, then how you might optimize the operations you want to perform depends on what those operations are, which you haven't fully described.

    In general, PDL might be of interest.

      Thanks for your response, and for pointing me to the tips on posting- quite new around here and much appreciate it. Here are some more details, to add to my original question. The matrix looks like this:
      NA 0.0132 0.0116 0.0082 0.0157 0.0246 0.0000 0.0338 0.0000 0.0173 NA NA 0.0084 0.0042 0.0025 0.0126 0.0061 0.0107 0.0214 0.0177 NA NA NA 0.0159 0.0078 0.0216 0.0133 0.0080 0.0029 0.0067 NA NA NA NA 0.0185 0.0362 0.0167 0.0266 0.0000 0.0093 NA NA NA NA NA 0.0268 0.0046 0.0134 0.0000 0.0000 NA NA NA NA NA NA 0.0063 0.0112 0.0138 0.0000 NA NA NA NA NA NA NA 0.0128 0.0239 0.0000 NA NA NA NA NA NA NA NA 0.0102 0.0188 NA NA NA NA NA NA NA NA NA 0.0088
      The elements are tab separated. All I want in the end, is for the NAs to be replaced by the numbers, no other operations need to be done. I need this because a program I want to run requires the input formatted in this way. I will take a look at PDL as sugegsted. Thanks again, Savita
        The first thing to check is that the data consumer (which the OP notes is not them) may be able to actually process the upper-triangular matrix as-is; look at LAPACK, which has routines that operate on triangular matrices directly without needing to copy - in other words, this task may not actually be needed.

        If it is, the approach I would take using PDL is to pre-allocate a correctly-sized disk-based ndarray (so that RAM is not a limitation) using PDL::IO::FastRaw (untested), from perldl:

        use PDL::IO::FastRaw; $pdl = mapfraw('fname', {Creat => 1, Dims => [460_000, 460_000], Datat +ype => double()});
        It looks like currently PDL::IO::Misc (with rcols) and PDL::IO::CSV don't have a facility to read into an existing (possibly disk-based) ndarray. It would still be possible, albeit slow, to read the file line-by-line, probably using Text::CSV_XS, using slice to insert each row into the ndarray.

        Once the ndarray exists, to copy one triangle into the other would be super-easy using PDL::LinearAlgebra::Real#tricpy.

Re: upper or lower triangular matrix to full
by NetWallah (Canon) on Sep 01, 2017 at 18:08 UTC
      Thanks, will try!
Re: upper or lower triangular matrix to full
by roboticus (Chancellor) on Sep 02, 2017 at 18:34 UTC

    savita:

    As others have noted, you're talking about a rather large dataset. I'd suggest not even bothering to reflect the data around the diagonal, and instead manipulate your indices to fetch the data:

    $ cat pm_1198523.pl #!env perl use strict; use warnings; my $rData = [ [ 1, 3, 5, 7, 9 ], [ 0, 2, 4, 6, 8 ], [ 0, 0, 3, 6, 9 ], [ 0, 0, 0, 8, 4 ], [ 0, 0, 0, 0, 3 ] ]; bless $rData, 'foo'; print $rData->fetch(1,4), "\n"; print $rData->fetch(4,1), "\n"; print $rData->fetch(3,1), "\n"; print $rData->fetch(1,3), "\n"; package foo; sub fetch { my ($self, $r, $c) = @_; ($r, $c) = ($c, $r) if $c < $r; return $self->[$r][$c]; } $ perl pm_1198523.pl 8 8 6 6

    Also, if you're building the file containing the data yourself, I'd suggest making all your data elements the same size so you can access the data directly from the file via seek.

    If you'd like to reduce the size of your data file, and you're building it with fixed length elements, then you don't even need to store the lower "NA" triangle: instead you can use the idea above to fetch the data you're wanting by swapping the indices, and compute the seek address directly to compress out the lower triangle, like this:

    $ cat pm_1198523_2.pl #!env perl use strict; use warnings; my ($R, $C, $sz); # Row, Column and element size $sz = 5; for my $r ([0,0], [1,0], [1,1], [2,0], [2,1], [2,2], [12,5]) { ($R,$C) = @$r; my $seek_addr = seek_addr($R, $C, $sz); print "[$R,$C] => $seek_addr\n"; } sub seek_addr { my ($R, $C, $sz) = @_; ($R,$C) = ($C,$R) if $R < $C; my $slot = $R*($R+1)/2 + $C; return $sz * $slot; } $ perl pm_1198523_2.pl [0,0] => 0 [1,0] => 5 [1,1] => 10 [2,0] => 15 [2,1] => 20 [2,2] => 25 [12,5] => 415

    Note: You may want to adjust things depending on how you plan to use the data. Random access of the items is likely to be excruciatingly slow as other monks have mentioned. If you can arrange your algorithm so that you primarily iterate through columns or rows then you can make the disk cache work with you rather than against you. (You may have to swap R/C above depending on your access pattern.)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks very much for your detailed response! Yes, I am building the file containing the data myself, so I don't need to store the lower triangle of NAs, as you have pointed out. This will be input to a program (not written by me), so I am not entirely sure how the items will be accessed- I will check with the author, and see if I can order the matrix in a way that the program can read speedily! Thanks again for your helpful advice.
Re: upper or lower triangular matrix to full
by BrowserUk (Patriarch) on Sep 01, 2017 at 15:34 UTC

    With a matrix 460000x460000 and 32bytes per element(scalar+pointer), you would need need over 6.15TB of memory to hold it in Perl.

    Even if you used C, and all your values were less than 256, you'd still need 197GB of memory.

    Do you have such a machine?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      Unfortunately not! I have a 256GB machine. Thanks for your response, I will check to see if C is an option!

        If you accumulate and discard as you go, you can substantially reduce the memory requirement.

        Ie. (Using 10x10 as an example) After reading the first row, and duplicating the 9 elements to the lower triangle you need storage for 19 elements:

        +---+---+---+---+---+---+---+---+---+---+ | a | b | c | d | e | f | g | h | i | j | +---+---+---+---+---+---+---+---+---+---+ | b'| +---+ | c'| +---+ | d'| +---+ | e'| +---+ | f'| +---+ | g'| +---+ | h'| +---+ | i'| +---+ | j'| +---+

        But then you can print out the first row to the new file and free up that memory leaving only 9 elements in ram.

        You then read the second row, duplicate the 8 elements, and print out the second row to the new file and discard it. And you now have 8x2=16 in memory.

        The peak memory requirement comes when you have read the 5th row, at which point you will have 5x5 =25 elements in the lower triangle cache, and (breifly) the 10 elements of the 5 row before writing it out and discarding it. So rather than 10x10=100 you have a maximum requirement of (n/2)2 + n = (10/2)2+10 = 35.

        Or for your example (460000/2)2 + 460000 = 52,900,460,000 rather than 4600002 = 211,600,000,000 or ~60% less. Still needs a lot of memory, but it might help if your numbers are small enough to allow a C solution.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: upper or lower triangular matrix to full
by Laurent_R (Canon) on Sep 02, 2017 at 08:47 UTC
    You don't say how you're going to populate the lower triangle of the matrix, so it is quite difficult to figure out any solution.

    The bottom line is this: can you just read and process one row at a time? If you can, just do it. I'm dealing almost daily with files too large to fit into memory. Most of the time, I get around by reading and processing them one record at a time. But every now and then, it gets more complicated: in such circumstances, there is often no general solution; the possible solution (or solutions) is specific to the problem at hand.

    If you can't process your file line by line, then you'll have to tell us why and be much more specific about what you're trying to do exactly. Some of us might be able to suggest another solution.

Re: upper or lower triangular matrix to full
by thanos1983 (Parson) on Sep 01, 2017 at 15:37 UTC

    Hello savita,

    Although I know very little about matrixes, I was bored and I wanted to wake up my mind so I wrote this. Something like that could do the job (I guess).

    Sample of code:

    Update: Maybe a while loop in writing consumes less memory:

    Update2: I manage to update the code using IO::All maybe this could help:

    Update3: This is the final code that does ( I tried reading in the file into an array of arrays):

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: upper or lower triangular matrix to full
by Anonymous Monk on Sep 01, 2017 at 16:19 UTC
    Read as many rows into memory as you can. Append each column to a separate output file (you'll have to close the file afterwards, because you can't hold 460000 file handles open). After processing the whole input file, read the column files in order to get the matrix transpose. Take the lower triangle of that and combine with the upper triangle of the original.
      The following still takes 90s for size 10_000, and 800s for size 20_000 on my machine (with some random tuning of the LOAD_AT_ONCE constant). 640_000 would still take several years. Nevertheless, I haven't been able to find a faster solution.

      #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $LOAD_AT_ONCE = 500; my $filename = shift; open my $IN, '<', $filename or die $!; my @part; sub out { for my $i (0 .. $#{ $part[0] }) { open my $OUT, '>>', "$$.$i" or die $!; say {$OUT} join "\n", map $_->[$i], @part; } @part = (); } while (<$IN>) { push @part, [ split ' ' ]; out() if @part == $LOAD_AT_ONCE; print STDERR "Phase 1: ", $IN->input_line_number, "\r"; } out() if @part; print STDERR "\n"; my @files = glob "$$.*"; seek $IN, 0, 0; for my $i (0 .. $#files) { print STDERR "Phase 2: $i\r"; open my $COL, '<', "$$.$i" or die $!; while (<$COL>) { chomp; last if $_ eq 'NA'; print "$_ "; } my @rest = (split ' ', <$IN>)[ 1 + $i .. $#files ]; say "@rest"; } unlink glob "$$.*";

      I used the following code to generate the input matrix:

      my $SIZE = 1000; sub create_matrix { my ($filename) = @_; open my $OUT, '>', $filename or die $!; for my $i (1 .. $SIZE) { for my $j ( 1 .. $SIZE ) { print {$OUT} $i <= $j ? $i * $j : 'NA'; print {$OUT} ' ' unless $SIZE == $j; } print {$OUT} "\n"; } }

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        You and others in this thread silently assume "matrix" meaning a somehow delimited data file. What if the file looks like this:
        0001 1202 3030 ... 8491 9382 9381 ...
        In such a fixed lenght case you don't need any memory (well, kinda) and can just do the task by seek()ing the appropriate positions on disk.

        We won't know unless the OP tells us.


        holli

        You can lead your users to water, but alas, you cannot drown them.
        It helps if you combine several columns into each of the tempfiles. But yeah, slinging around this much data is... challenging.