healingtao has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have a csv file which has about 100mb 60k rows and has 10 columns.The first 2 colmns are the key. I need to load it into memory and based on keys passed in externally find the matched row and extract some data to do some calculations. The external input will come in with about 5k keys. I'm curious what is the most efficient method for getting this done? Thanks in advance

  • Comment on load a file in memory and extract parts

Replies are listed 'Best First'.
Re: load a file in memory and extract parts
by aaron_baugher (Curate) on May 05, 2015 at 21:26 UTC

    In general, when you want to use keys from one file to lookup values in another file, you load the keys from one file into a hash as its keys, then loop through the other file checking to see if each line's key exists in the hash, and doing something with it if it is. Unless there's a reason to do otherwise, it's usually best to load the smaller file (in this case your 5K one) into the hash, then loop through the other file. So in pseudo-code:

    open 5k file foreach line get key from line and put it in hash as key=1 close 5k file open 100M file foreach line get key from line if key is in hash from other file do stuff with the line close 100M file

    Once you have some code which attempts to do that, show it to us along with a few lines of sample input and output data, and we can guide you further if you need it.

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

      Tux seems to be offline, so I'll link to Text::CSV_XS:

      • Text::CSV_XS takes care of reading and writing CSV files. Unlike most "five lines of perl" attempts, it handles most, if not all, nasty edge cases.
      • DBD::CSV sits on top of Text::CSV_XS and allows SQL access to CSV files. It may be slower than SQLite proposed by locked_user sundialsvc4, but avoids converting CSV to SQLite.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: load a file in memory and extract parts
by ww (Archbishop) on May 05, 2015 at 21:18 UTC

    You might want to clarify your statement of your initial quantities.

    I (perhaps carelessly) first read "100mb 60k rows)" as 1 x 106 (aka '1.e+6' aka 'one million') rows, each with 10 columns. Unless most of the columns contain only a single character (or something close thereto), you're looking at a lot to slurp ... which see using Super Search here.

    But a "mb" can mean 10242 bytes or 1,000,000 (decimal) bytes (often used by manufacturers of hard disks and computers to exagerate the size of their product)... and in any case, it's not conventionally used to [state|discuss|evaluate] a pure number.

Re: load a file in memory and extract parts
by GotToBTru (Prior) on May 05, 2015 at 22:03 UTC

    Build an index using the key as the hash index, and the line number as the value. Use Tie::File to keep from having to slurp all 60k records at once.

    Dum Spiro Spero
Re: load a file in memory and extract parts
by locked_user sundialsvc4 (Abbot) on May 05, 2015 at 23:22 UTC

    Obviously, the first notion that pops into my mind is:   “this would be a great application for an SQLite database file!”

    SQLite, as you know, is a powerful SQL system that stores everything into just one file.   It already has existing facilities to import a CSV-file.   It can quite-effortlessly import your file into a table, then create indexes on that table to allow the key (which consists of two columns ...) to be retrieved very fast.   You can, if you like, even import the incoming file into another table, and then your entire problem has basically been reduced to an INNER JOIN.

    Food for thought, anyway.   Personally, I love to stumble-upon situations where I don’t have to write a single line of “programming” in order to solve a problem.   Maybe this will work for you; maybe not.   But if it does, you could chomp this entire problem in a matter of minutes.

      I tried the following code

      sub TestArray { my @file_array; my $InputFile = "/home/test/test_file.dat"; # '|' separator file my $column; my $row; tie @file_array, 'Tie::File', $InputFile or die $!; shift @file_array; foreach $row (0..@file_array-1) { foreach $column (0..@{$file_array[$row]}-1) { $LOG->info( "Element [$row][$column] = $file_array[$row][$ +column]\n"); } } }

      At this point I'm trying to load the file into a multidimentional array. The file has 10 columns and 50,000 rows. How can I extract data from this array. When I run this, I'm getting the following error: Can't use string ("DEAL_NAME|TRANCHE_NAME|GROUP_NO|") as an ARRAY ref while "strict refs" in use at ./test_feeds.pl

        Tie::File lets you access each line of your file as an array element. It does not load the file into a multidimensional array. The reason for the warning is that you are trying to access an element in an Array-of-Arrays data structure, but you just have a simple array (perldsc). Maybe split will help:
        use warnings; use strict; my @file_array = qw(a|b|c d|e|f); foreach my $row (0 .. @file_array - 1) { foreach my $column (split /\|/, $file_array[$row]) { print "$row $column\n"; } } __END__ 0 a 0 b 0 c 1 d 1 e 1 f