Benbo has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I am currently trying to write a script which alters the contents of a 2D array. I first created a 2D array filled with '*':

******************** ******************** ******************** ********************

Then attempted to change parts of the 2D array based on y-value, x value and replacement string. So for instance, a change of (1, 2, "test) in the above array would yield

******************** **test************** ******************** ********************

This all worked fine, except that the arrays I was dealing with in the real world were 1.3-4 million columns x 100 rows. As you can imagine I pretty soon ran out of memory. So my next idea was to create an output file containing the '*' array and then use Tie::File and string manipulation to alter the rows, to save holding it all in memory. I then hit two problems with Tie::File (version 0.96). Using the following code from the CPAN documentation:

tie my @array, 'Tie::File', $outputFile or die $!; my $line = $array[2]; print $line . "\n"; untie @array;

Throws a "Use of uninitialized value in concatenation" error when printing the line, even though the outputFile contained 100 rows. The second problem I had was that using:

tie my @array, 'Tie::File', $outputFile or die $!; $array[2] = 'This is a test'; untie @array;

inserted 2 blank rows, then 'This is a test', then the original 100 rows. I was expecting it to simply replace the 3rd line. My complete code is here:

#!/usr/bin/perl use strict; use warnings; use Tie::File; my $genome_size = 1000; #1300000; my $outputFile = "/Users/Benbo/Desktop/temp/template.txt"; unlink $outputFile if (-e $outputFile); print "Start...\n"; add_fragment("Test", 2, 10, 200); create_template(); print "Done\n"; sub add_fragment{ my ($ref_id, $identity, $coord, $length) = @_; tie my @array, 'Tie::File', $outputFile or die $!; # $array[10] = 'This is a test'; my $line = $array[2]; print $line . "\n"; untie @array; } sub create_template{ for (1..9){ write_to_file($_ x 50 . "\n", $outputFile); } } sub write_to_file{ my $input = shift; my $outputFile = shift; open my $output, ">>", "$outputFile" or die "Could not open $outp +utFile: $!"; print $output $input; }

So my first question is can anyone tell me what I'm doing wrong with Tie::File as it is not working as I thought it would from the CPAN docs. My second question is, given that I will be inserting around 300k fragments into the array, is there a quicker and more effcient way of doing this?

Many thanks,
Benbo

Replies are listed 'Best First'.
Re: Question regarding Tie::File or a better way to handle huge 2-D arrays
by BrowserUk (Patriarch) on Aug 23, 2008 at 20:42 UTC

    Use strings instead of arrays. The following shows that a 100 element array containing strings of 4 million characters takes less that 400MB which is well within the capabilities of most modern machines.

    It also shows a subroutine for doing the substitutions that very closely mirror the syntax you have above. And finally, it shows making 300,000 random substitutions takes less than 1 second on my machine:

    use Devel::Size qw[ total_size ];; $a[$_] = '*'x4e6 for 0 .. 99;; print total_size \@a;; 400003052 sub change { my( $y, $x, $text ) = @_; substr $a[$y], $x, length $text, $text; };; print time(); change( int( rand 100 ), int( rand 4e6 ), 'test' ) for 1 .. 3e5; print time;; 1219523963 1219523963

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Use strings instead of arrays.

      Tie::File uses tied arrays, and (as far as I know) they are implemented in a way that not all of their elements are held in memory at the same time.

Re: Question regarding Tie::File or a better way to handle huge 2-D arrays
by betterworld (Curate) on Aug 23, 2008 at 20:15 UTC

    I think the most efficient solution would be to open the file in read/write mode, then seek to the position where you want to put "test", then write it.

    This should not be hard because your lines are of constant length, so you can easily compute the byte position of your target string. However you should take care on certain operating systems that treat "\n" as more than one character.

    Tie::File still has to search for the newlines, and (afaik) it operates on whole lines; as I understand it, your lines are much longer than your columns, so this is still very unefficient.

Re: Question regarding Tie::File or a better way to handle huge 2-D arrays
by dragonchild (Archbishop) on Aug 24, 2008 at 04:48 UTC
    You might also try DBM::Deep - it's Perl data structures backed by disk instead of RAM.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Question regarding Tie::File or a better way to handle huge 2-D arrays
by roboticus (Chancellor) on Aug 23, 2008 at 21:29 UTC
    Benbo:

    You might also consider sparse arrays if the 'default' value is the overwhelming majority of the data that will ever be there...

    ...roboticus
      Thanks to all. Moving from arrays to strings has really sped things up. Cheers, Benbo