walkingthecow has asked for the wisdom of the Perl Monks concerning the following question:

I have a decently large file, : delimited, and I am wondering what would be the best way to load it into memory: arrays or hashes? Also, I am not modifying the data, only using it to break down one field and search a database by names.


The file looks like this: daamaya:Daniel R. Amaya, 00000232423:10:ED98765:PLK

It has about 100 to 100,000 entries like that, depending on which file is being used, and I am basically just trying to load all the daamaya <ids> and Daniel R. Amaya <names> into hashes. I would also like to break name down to three variables: $fname, $initial, $lname.

I then use this data to search a HUGE CSV file. My search string is first the id from the input file, and then the name (broken down).

Replies are listed 'Best First'.
Re: Loading file into memory
by zentara (Cardinal) on Aug 05, 2008 at 18:06 UTC
    Hashes take more space, are a bit slower than arrays, but are easier to use with the keys, etc. With arrays, you will need to setup indices, and things might get confused if you get something out of place. What you want to do, is read the file line by line, regex each line to get the 3 variables, then load them into your hash.

    If you are going to be doing it often, you may want to use dbmopen and make a db file, so you can skip the regexes on subsequent runs. You can also use Storable for this:

    #!/usr/bin/perl use Storable; %hash = (1=>"one", 2=>"two", 3=>"three"); foreach $key (keys %hash) {print "$key => $hash{$key}\n"} print "###############################################\n"; store(\%hash, 'persistent-data.db'); # later on... #$href = retrieve('persistent-data.db'); # by ref %hashback = %{retrieve('persistent-data.db') }; # direct to hash foreach $key (keys %hashback){print "$key => $hashback{$key}\n"}

    I'm not really a human, but I play one on earth Remember How Lucky You Are
Re: Loading file into memory
by Tanktalus (Canon) on Aug 05, 2008 at 20:31 UTC

    I know I'm not really answering your question (but that's because I suspect an XY Problem here), but I'd likely approach the same issue completely backwards. Instead of loading anything into memory, I'd likely start by trying to transform your CSV file into a CSV file that has all the information of interest in it. And then I'd use DBD::CSV (again, since I probably would have used it both to read the original CSV file and write the transformed CSV file) to create SQL JOIN queries to get the information I want out of each.

    The advantage to this is that I could then migrate to another database (which is where data should be), such as mySQL, PostgreSQL, SQLite, or even DB2 or Oracle or whatever, when I need more speed. I suspect that this would be faster than whatever you do in perl... even if, at the beginning of each run, I need to pre-populate the database by loading from the CSV files. Even before migrating to a database, I don't expect this to be hugely slower than your approach, but it does scale better.

Re: Loading file into memory
by monarch (Priest) on Aug 05, 2008 at 18:32 UTC

    Another method: store a hash with the key being the name and the value being the position in the file of the desired record.

    This will generally be slower when accessing records because the file itself must be interrogated, but it could potentially be faster as it minimises the risk of the file turning into swap memory (which is expensive to access at any rate).

    Your specification was rather unclear, so for the purposes of example code I will assume that you want a record given an $id (e.g. "daamaya") and a $name (e.g. "Daniel R. Amaya"). So I will construct a hash of ids that point to a hash of names that point to a position in the file containing the desired record.

    E.g.

    my %ids = (); my $curpos = 0; # position within the file while ( defined( my $line = <HANDLE> ) ) { # we want my ( $part_left, $part_right ) = split( /\s*,\s*/, $line ); my ( $id, $name ) = split( /:/, $part_left ); $ids{$id}->{$name} = $curpos; $curpos += length( $line ); } # later when we want to look up a record sub get_record( $ $ ) { my ( $id, $name ) = @_; # arguments my $filepos = $ids{$id}->{$name}; seek( HANDLE, $filepos, 0 ); my $line = <HANDLE>; my ( $part_left, $part_right ) = split( /\s*,\s*/, $line ); return( $part_right ); } # e.g. get_record( "daamaya", "Daniel R. Amaya" ); # process $line
Re: Loading file into memory
by TGI (Parson) on Aug 05, 2008 at 18:34 UTC

    If you want to use arrays to save space, it can be helpful to set up constants for your indexes.

    use constant ID => 0; use constant FIRST_NAME => 1; use constant INITIAL => 2; use constant LAST_NAME => 3; # Processed data would look like this: my @data = ( # ID FNAME MI LNAME [qw( dramaya Daniel R Amaya )], [qw( bjsmith Bob J Smith )], [qw( abjones Angela B Jones )], ); # Access your data: foreach my $entry (@data) { my $id = $entry->[ID]; my $fname = $entry->[FIRST_NAME]; # or you could do: my ( $first, $mi, $last, $id ) = @{$entry}[ FIRST_NAME, INITIAL, LA +ST_NAME, ID ]; }

    Using constant array indexes can be a nice help for readability.


    TGI says moo

      I appreciate all the help guys! However, I think I should clarify some...

      First, I made a typo in my first post. Each record in the file actually looks like this:
      ID      NAME      PLK      NUM1      NUM2
      daamaya:Daniel R. Amaya,PLK,0000056789,ED97865:10:25:blah:blah

      Now, I need every field of that somehow loaded into memory. Then I want to use the $id and $name fields to search through my CSV file. If $name is found (e.g. Daniel Amaya), then I want to print the PLK,num1,num2 from the CVS file to the screen (or a file). I just figure that is a faster way to use one file's fields to search another. I was trying to do it by opening file line-by-line and then searching CSV, but it was super slow.

        You want to process the CSV file and put into a database or some sort. Either a DB file or SQLite. If you use SQLite, make sure you index the fields you'll be searching against. If you use a DB file, (such as the Berkeley DB) you'll want to think carefully about your data structure. MLDBM is also worth looking at. Your query times will improve dramatically.

        So your flow should look something like this:

        my $dbh = Read_Huge_File_Into_DB( $huge_file_path ); my @customers = Process_Customer_Information_File( $dbh, $file_path ); Print_Report(\@customers); sub Process_Customer_Information_File { my $dbh = shift; my $file = shift; open( $info, '<', $file ) or die "Uh oh $!"; my @customers_found; while ( my $line = <$info> ) { my $customer_data = ParseCustomerData($line); my $name = $customer_data->[NAME]; if ( Customer_Found( $dbh, $name ) ) { push @customers_found, $customer_data; } } return @customers_found; }

        If it were me, I'd use SQLite.


        TGI says moo