Hi, I want to search a large CSV file (120,000 lines & 10 columns each) and extract lines that match a hash of about 5,000 keys in it, based on both key and value.Right now I read the CSV file with DBI to do a SQL query on it for each key/value in the hash:
sub dataextractor($) { my $input = shift; my $output = shift; my (@colname,@db_result); open(OUT,">".$output) or die "Can't open the file $output\n"; open(IN,"<".$input) or die "Can't open the file $input\n"; while (<IN>) { print OUT $_; s/\r?\n$//; @colname = split (',',$_); last; }#while close(IN); my $array_size = scalar(@colname); my $db_query; my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;") or die "Can' +t connect with the DB: $!\n"; $dbh->{'csv_tables'}->{'csvtable'} = { 'file' => $input}; foreach my $key (keys %site_peakhr) { $db_query = "SELECT ".join(',',@colname)." FROM csvtable where $co +lname[2] like $key and $colname[1] = $site_peakhr{$key}" ; @db_result = $dbh->selectrow_array($db_query); if (scalar(@db_result) < 1) { $db_result[0] = $date; $db_result[1] = $cell_peakhr{$key}; $db_result[2] = $key; foreach my $count (3..$array_size-1) { $db_result[$count] = "-"; }#foreach }#if print OUT join(',',@db_result)."\n"; }#foreach close(OUT); return ($output); }

The problem with this approach is that the CSV file (40MB) is loaded to DBI engine 5,000 times and takes hours to process.

Is there a fastest way to do this? the CSV file has the 5,000 sites with hourly data (0 to 23), this in an example of the CSV file:

DATE,HOUR,SITE,SALES,ITEMS,ITEMS2,ITEMS3,ITEMS4,ITEMS5,ITEMS6
11/02/2011,20,NAW6546,51.0,124.0,82.0,50.0,0.0,0.0,366.0
11/02/2011,21,NAW6546,72.0,300.0,82.0,50.0,0.0,0.0,214.0
11/02/2011,22,NAW6546,5.0,254.0,82.0,50.0,0.0,0.0,985.0
11/02/2011,23,NAW6546,47.0,530.0,82.0,50.0,0.0,0.0,517.0

And the hash %site_peakhr has the hour of more sales on each site NAW6546 => 15 NAW1457 => 22 ...

The other approach I was thinking of is to read the CSV line by line and search for the key value in the hash and if is in it, print it to the output file, if not, go to the next, but don't know if it will be faster.

Any suggestions?

In reply to Best way to match a hash with large CSV file by alphavax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.