faozhi has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys

The server i am going to run the belowmentioned script has 64GB memory. I can only use maximum 80% memory load.
The perl script below is using Data:Dump to dump all the data.
#!/usr/bin/perl -w use strict; use warnings; use Data::Dump qw/dump/; my %data; foreach my $filename (qw/one.txt two.txt three.txt/) { open( my $file, $filename ) or die "Unable to open $filename because $!\n"; while (<$file>) { chomp; my ( $chrX, $chrpos, $value1, $value2 ) = split(/\s+/); $data{$chrX}->{$chrpos}->{'value1'} += $value1; $data{$chrX}->{$chrpos}->{'value2'} += $value2; } ## end while (<$file>) } ## end foreach my $filename (qw/one.txt two.txt three.txt/) print dump( \%data );
the text files are as below
one.txt chromosome1 50000 12 20 chromosome2 20000 0 21 chromosome3 41444 9 2 chromosome4 21414 4 1 . . . (there would be 5million lines of the above)

This applies to two.txt and three.txt as well, same format, same number of lines (approximately 5million)
I would like to know is this script able to run on the server without overloading?
Or is there any other better way of doing this without putting much load on the memory?

Cheers guys!

Replies are listed 'Best First'.
Re: Enquiry on memory usage
by BrowserUk (Patriarch) on May 06, 2009 at 14:06 UTC
    I would like to know is this script able to run on the server without overloading?

    Yes, easily. Your data will require ~1.8GB of memory.

    Or is there any other better way of doing this without putting much load on the memory?

    Possibly. You could sort all 3 files and read them in parallel, and output the information line by line. This would use minimal memory and isn't difficult to program, though there are a lot of edge cases that are easy to get wrong.

    But given that your current solution will use less than 3% of the server's capacity, or 3.5% of your maximum allocation, there seems no reason to move away from the simple direct approach.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Cheers for that. Can I know how you calculate it?

        Sure. I ran the following command which creates a 1 million key HoHoH, and then pauses:

        perl -e"$h{ $_ }{1} = { 1 .. 4 } for 1..1e6; <>"

        I then looked at my process monitor and saw that it required 600MB of memory, so I multiplied by 3 (the number of files) to come up with ~1.8GB. I've made a lot of assumptions. eg. that the chromosomes within the 3 files substantially overlap. I could have multiplied by 5 and arrived at 3GB for example. My guess is that your actual requirement will be somewhere between.

        But either way, the total memory requirement is very unlikely to seriously threaten your 50GB maximum, so there was no need to suggest that you change your methods.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Enquiry on memory usage
by citromatik (Curate) on May 06, 2009 at 12:23 UTC

    The dumping process can eat a lot of memory for big data structures (at least for Data::Dumper, I don't know if this is also true for others dumpers like Data::Dump or Data::Dump::Streamer). Maybe you can consider other alternatives for storing the your data, like a DBMs, Storable or YAML

    HTHs,

    citromatik

Re: Enquiry on memory usage
by mikeraz (Friar) on May 06, 2009 at 12:46 UTC

    I don't know if the data structure would overwhelm your system resources. Regardless, you'll probably want to do something besides splat out the resulting data.

    Save the ram and put your data into something mungable. SQLite is one option for a self hosted, easy to port around database. Perl has SQLite::DB, DBIx::SQLite::Simple, Class::DBI::SQLite, and many other modules to assist in storing, retrieving and manipulate your millions of records.

    There are also many modules and libraries for working with genomic data. They're out of my area of competence and I'll leave commentary on them to someone who has experience with them. Consider undertaking your own CPAN search to find them.


    Be Appropriate && Follow Your Curiosity