alphavax has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I want to search a large CSV file (120,000 lines & 10 columns each) and extract lines that match a hash of about 5,000 keys in it, based on both key and value.Right now I read the CSV file with DBI to do a SQL query on it for each key/value in the hash:
sub dataextractor($) { my $input = shift; my $output = shift; my (@colname,@db_result); open(OUT,">".$output) or die "Can't open the file $output\n"; open(IN,"<".$input) or die "Can't open the file $input\n"; while (<IN>) { print OUT $_; s/\r?\n$//; @colname = split (',',$_); last; }#while close(IN); my $array_size = scalar(@colname); my $db_query; my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;") or die "Can' +t connect with the DB: $!\n"; $dbh->{'csv_tables'}->{'csvtable'} = { 'file' => $input}; foreach my $key (keys %site_peakhr) { $db_query = "SELECT ".join(',',@colname)." FROM csvtable where $co +lname[2] like $key and $colname[1] = $site_peakhr{$key}" ; @db_result = $dbh->selectrow_array($db_query); if (scalar(@db_result) < 1) { $db_result[0] = $date; $db_result[1] = $cell_peakhr{$key}; $db_result[2] = $key; foreach my $count (3..$array_size-1) { $db_result[$count] = "-"; }#foreach }#if print OUT join(',',@db_result)."\n"; }#foreach close(OUT); return ($output); }

The problem with this approach is that the CSV file (40MB) is loaded to DBI engine 5,000 times and takes hours to process.

Is there a fastest way to do this? the CSV file has the 5,000 sites with hourly data (0 to 23), this in an example of the CSV file:

DATE,HOUR,SITE,SALES,ITEMS,ITEMS2,ITEMS3,ITEMS4,ITEMS5,ITEMS6
11/02/2011,20,NAW6546,51.0,124.0,82.0,50.0,0.0,0.0,366.0
11/02/2011,21,NAW6546,72.0,300.0,82.0,50.0,0.0,0.0,214.0
11/02/2011,22,NAW6546,5.0,254.0,82.0,50.0,0.0,0.0,985.0
11/02/2011,23,NAW6546,47.0,530.0,82.0,50.0,0.0,0.0,517.0

And the hash %site_peakhr has the hour of more sales on each site NAW6546 => 15 NAW1457 => 22 ...

The other approach I was thinking of is to read the CSV line by line and search for the key value in the hash and if is in it, print it to the output file, if not, go to the next, but don't know if it will be faster.

Any suggestions?

Replies are listed 'Best First'.
Re: Best way to match a hash with large CSV file
by chromatic (Archbishop) on Nov 03, 2011 at 17:25 UTC
    Any suggestions?

    If you need to do something once, do it only once. That includes populating the database and preparing the query. You might even be able to load the CSV file into a database with a bulk loader program and skip that step in Perl.

    From a design point of view, your function does way too much, which is why it does the same work over and over, and that's one reason it's too slow.


    Improve your skills with Modern Perl: the free book.

      So you are saying is to do the load of the CSV to the DB and the prepare once, and just execute the queries for each value in the hash? using placeholders? Like:

      $sth = $dbh->prepare( " SELECT * FROM csvtable WHERE SITE LIKE ? AND H +OUR = ? " ); foreach my $key (keys %cell_peakhr) { $sth->execute( $key, $cell_peakhr{$key}); @db_result = $sth->fetchrow_array() if (scalar(@db_result) < 1) { $db_result[0] = $date; $db_result[1] = $cell_peakhr{$key}; $db_result[2] = $key; foreach my $count (3..$row_size-1) { $db_result[$count] = "-"; }#foreach }#if print OUT join(',',@db_result)."\n"; }

        If you even need the database, yes. If your entire dataset fits into memory in a hash (as is likely), use a hash.


        Improve your skills with Modern Perl: the free book.

Re: Best way to match a hash with large CSV file
by BrowserUk (Patriarch) on Nov 03, 2011 at 19:09 UTC

    A hash is a O(1) lookup data structure. Using SQL to lookup each of its keys in a large list is ludicrous.

    Forget DBI & SQL.

    Read the lines from the file one at a time and look up the appropriate field(s) in the hash.

    The problem with this approach is that the CSV file (40MB) is loaded to DBI engine 5,000 times and takes hours to process.

    Looking up 120,000 items in a hash containing 5000 keys just took me 0.035816 seconds.

    $hash{ int rand( 120000 ) } = 1 for 1 .. 5000;; $t = time; exists $hash{ $_ } and ++$c for 1 .. 120000; print $c; printf "%.6f\n", time() - $t;; 4647 0.035816

    You'd be lucky to get an error message from DBI in that time.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thanks a lot BrowserUk!!!

      As you suggested I read the csv file line by line and did a search directly in the hash and completed the task in 6 seconds!

      sub dataextractor($) { my $input = shift; my $output = shift; my $order = shift; my (@data); open(OUT,">".$output) or die "Can't open the file $output\n"; open(IN,"<".$input) or die "Can't open the file $input\n"; while (<IN>) { if ($. eq 1) { print OUT $_; next; } s/\r?\n$//; @data = split (',',$_); if ($data[1] eq $site_peakhr{$data[$order]}) { print OUT $_."\n"; } }#while close(IN); close(OUT); return($output); }
Re: Best way to match a hash with large CSV file
by Marshall (Canon) on Nov 04, 2011 at 04:05 UTC
    The DB, if it is built with SQLite, will be so performant that it is great!

    Please say more about the relationship between input and the desired output...

    DATE,HOUR,SITE,SALES,ITEMS,ITEMS2,ITEMS3,ITEMS4,ITEMS5,ITEMS6 11/02/2011,20,NAW6546,51.0,124.0,82.0,50.0,0.0,0.0,366.0 11/02/2011,21,NAW6546,72.0,300.0,82.0,50.0,0.0,0.0,214.0 11/02/2011,22,NAW6546,5.0,254.0,82.0,50.0,0.0,0.0,985.0 11/02/2011,23,NAW6546,47.0,530.0,82.0,50.0,0.0,0.0,517.0
    So there is:
    A) date (11/02/2011),
    B) hour, like [ 20,20,21 ]
    C) SITE NAW6546,
    D) ITEMS 1..6
    6 seconds is a ridiculously long time, and would do a linear search through 700 files with 300K+ lines.
      The DB, if it is built with SQLite, will be so performant that it is great!

      That is a very bold, some would say rash, claim. Given you appear not to understand the problem.

      At some point in the run of a program the OP has a hash containing 5000 key/value pairs. At that same point in time on disk, there is a CSV file containing 120,000 lines of 10 fields.

      The task is, given the hash and an integer (that represents a field within each record), to produce an output file on disk that contains that subset of the input file where field 1 (0-based) of the record exists as a key in the hash, with the value of the field identified by the number, as its value.

      What you are suggesting is that in less than 6 seconds you can:

      1. Build an SQLite DB from the file of 120,000 records.
      2. Perform 5000 SQL queries -- against that 120,000 record DB -- of the form:
        select * from table where field1 eq ? and field? eq ?;
      3. Convert all the results records, from each of the 5000 queries, back to CSV records.
      4. Write them out to the new file.

        (For strict compliance, re-order the retrieved records into input file order before output.)

      5. (Delete the SQLite DB you created!)

      Good luck!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        1. Build an SQLite DB from the file of 120,000 records.
        No problem at all. This will take <1 second for a file with 120K lines.

        1.5) Index the DB. This step is not to be under estimated and was not mentioned.

        2. Perform 5000 SQL queries -- ...Ooooh...I did not say what kind of queries nor how many...But essentially, yes you can do a lot of these things per second, provided that you have indexed the DB in step 1.5 correctly. Some seconds will be required for this step.

        --Steps 3,4,5 are super trivial reformatting steps...

        3. Convert all the results records, from each of the 5000 queries, back to CSV records.

        4. Write them out to the new file.

        5. (Delete the SQLite DB you created!)

        Update:

        Creation of a DB with 266,551 lines from >600+ source files took 78 seconds.
        Fancy indexing took another 20 seconds.
        Basically 100 seconds and I'm "ready to roll" with 1/4 million lines.

        I have a whimp "Prescott" machine. This is more than a decade old.