david_lyon has asked for the wisdom of the Perl Monks concerning the following question:

Update 3:
Thanks a lot to everyones VERY useful comments which are valid and constructive. I am going to work on the code for a bit and once I get it working will come back for more detailed help if appropriate.

Thanks again for everyones generous time and help.....

Update2:
I think I got something working can someone check if its OK? Its working but not sure if its the most efficient?

Thanks again for everyones help, spologizies for all the questions.

Dave
#!/usr/bin/perl open(FILE, "args.test"); @files=<FILE>; my @allfiles; for my $filename(@files){ open FILE, $filename || die "Cannot open $filename for reading: $!\n"; push @allfiles, <FILE>; close FILE; } @test=grep /chr1:8325525/,@allfiles; print "@test\n";
Update Thanks for everyones comment so far, can someone give me some pointers to how to Array of Arrays or a Hash of Arrays from 100 files or point me to some code that does this... Thanks again. Hi everyone If I had say 100 files and I want to put each file into an array into memory like below. But how do I automatically create the name for each array so that I have 50 unique names? I can then get each array to spit out the data eg: Thank you for your help!
print "$array[1],$array2[1],$array3[1],$array4[1]...etc\/n";
while(<FILES>){ chomp; process_into_array($_); } sub process_into_array{ my $file=shift; open(DAT, $file); @file=(<DAT>); }

Replies are listed 'Best First'.
Re: slurping many Arrays into memory...Please Help
by sauoq (Abbot) on May 24, 2012 at 01:38 UTC
    Update2: I think I got something working can someone check if its OK?

    It looks like you've still got some issues to me.

    You should probably chomp your file names, for one.

    And you're not building your array of arrays right. Your line:

    push @allfiles, <FILE>;
    Will push all of the lines from all of the files on one array, @allfiles. You want:
    push @allfiles, [ <FILE> ];
    instead.

    BTW, if you want to keep the filenames, you can use a hash of arrays instead... then the line would look like:

    $allfiles{ $filename } = [ <FILE> ];

    -sauoq
    "My two cents aren't worth a dime.";
Re: slurping many Arrays into memory...Please Help
by Anonymous Monk on May 24, 2012 at 00:38 UTC
Re: slurping many Arrays into memory...Please Help
by jwkrahn (Abbot) on May 24, 2012 at 00:39 UTC

    Use an Array of Arrays or a Hash of Arrays.

Re: slurping many Arrays into memory...Please Help
by Marshall (Canon) on May 24, 2012 at 14:28 UTC
    What kind of memory structure is appropriate depends upon what you are going to do with the data (how you intend to process it). It could be helpful if you could explain a bit about that. The only reason to keep all 100 files in memory is if there is some connection between the data in the files. Otherwise you can just process each file individually, one at a time. Of course if these files are big, storing them all in memory at the same time is going to take a lot of memory!

    When you build the memory structure, some initial processing (like maybe splitting out the important data fields 1, 3, 8 10) is usually appropriate rather than storing a verbatim copy of the line from the file.

    Small note:

    open FILE, $filename || die "Cannot open $filename for reading: $!\n"; #due to precedence rules, if you use the || #parens are needed open (FILE, '<', $filename) || die "Cannot open $filename for reading: $!\n"; #or use the lower precedence "or" open FILE, '<', $filename or die "Cannot open $filename for reading: $!\n";
    Update: if you are just grepping for certain lines, consider using the command line grep to get the lines of interest. The file system will do some file caching - the 2nd , 3rd grep will speed up. Whats "best" depends upon how many searches you are going to do. Or perhaps if you are always searching for just one field, a hash data structure keyed on that field may be appropriate.

    I do have one application that uses 700-900 flat files as a database. Each file is 100's to a few thousands of lines. On my Win XP system, the first linear search takes 7 seconds (open each file, read each line, etc). After that, subsequent searches take <1 second due to file system caching. Results get displayed as they arrive in a Tk UI. Average user session is 1-2 hours. Nobody has ever even noticed that the first search is "slow" or that it speeds up - the results spew out so fast that the human can't process them as fast as they come. I am converting this to an SQLite DB and it will be even faster, but the point is: try the simple stuff first and see how it goes before trying to optimize. In this case, all the files become memory resident without me having to do anything at all and I have very simple code that doesn't use much memory in my process space. Just a thought.

Re: slurping many Arrays into memory...Please Help
by CountZero (Bishop) on May 24, 2012 at 16:34 UTC
    If you explain what you are actually trying to do, it would be easier to check your code and tell you whether it is efficient or not. If all you try to do is to grep the files for the string "chr1:8325525" then there is no need to slurp all the data into memory first.

    If you show a short example of one of the input files and of the result you expect, then we can perhaps suggest more improvements for your code.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: slurping many Arrays into memory...Please Help
by locked_user sundialsvc4 (Abbot) on May 24, 2012 at 14:11 UTC

    Indeed, usually it is not a good practice to slurp a lot of data “into memory,” because all that you’re actually doing is trading one kind of disk file for another.   All “memory” is, of course, virtual, and if you start grabbing lots of data into your virtual-memory space, paging is going to start happening and now you are (very expensively...) moving data from one part of the disk-drive to another.   I suggest that you design the logic to perhaps locate all of the files first, but then to process them in some semi-sequential fashion such that you are not constantly closing-and-reopening them.   Such programs not only get-started faster, but they have more predictable (and favorable...) performance characteristics more-or-less regardless of actual data volume.   This is a “rule of thumb” to be sure, and every real-world case is different, but it’s a good rule of thumb to me.