josephjohn has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I need to parse a csv file with 50000 lines, get computer names in each line, get associated data from another inventory csv file (over 50000 lines) process it and generate output files.
My problem is the search operation is taking too long (couple of hours to complete). I changed from linear to binary search but with little success. I have tried reading the whole file to arrays and processing line by line as well.
Following is the code i use. It may be a bad logic or the way i did it. Please provide your valuable suggestions.
Joseph
open DataFile, "Inventory.csv" or die ("Inventory.csv $!"); my @Inventory = <DataFile>; close DataFile; open DataFile, "clients.txt" or die ("clients.txt $!"); while (my $line = <DataFile>) { my @fields = split(/,/, $line); my $username; my @sys; my $beg = 0; my $end = $#Inventory; my $mid = int(($beg+$end)/2); while ($beg <= $end) { @sys = split(/,/,$Inventory[$mid]); if ($fields[3] eq $sys[0]) { $company = $sys[3]; $username = $sys[1]; #remove the matched line from inventory @Inventory = (@Inventory[0..$mid-1],@Inventory[$mid+1..$#I +nventory]); last; } elsif ($fields[3] lt $sys[0]) { $end = $mid - 1; } else { $beg = $mid + 1; } $mid = int(($beg+$end)/2); } push @Data, $fields[3].",".$company.",".$fields[6].",".$fields[7]. +",".$fields[4].",".$fields[11].",".$username.",".$sys[4]; } #processing @Data after this

Replies are listed 'Best First'.
Re: Data parsing - takes too long
by Limbic~Region (Chancellor) on Jan 13, 2006 at 18:50 UTC
    josephjohn,
    Unfortunately, you haven't given enough information to give a working solution. See this for an example of well defined requirements and assumptions. Here are some ideas though.

    Instead of slurping the first file into an array, you should consider a hash instead. This will improve your search time and also will get rid of your terribly inefficient way of removing the record from the array (see splice as an alternative to your method) by using delete. Additionally, using a hash will give meaningful names to columns. If you are forced to use an array, consider the constant pragma to make the indices more meaningful.

    Cheers - L~R

    Update: Here is a proof of concept

    use constant COMP => 0; use constant USER => 1; use constant UNKN => 2; my %inventory; open(my $fh_inv, '<', 'Inventory.csv') or die "Unable to open 'Invento +ry.csv' for reading: $!"; while ( <$fh_inv> ) { chomp; my ($key, $company, $user, $unkn) = (split /,/)[0, 3, 1, 4]; $inventory{$key} = [$company, $user, $unkn]; } my @Data; open(my $fh_clients, '<', 'clients.txt') or die "Unable to open 'clien +ts.txt' for reading: $!"; while ( <$fh_clients> ) { chomp; my @field = split /,/; my $iref = $inventory{$field[3]}; if ( @$iref ) { push @Data, (join ',', $field[3], $iref->[COMP], @field[6,7,4, +11], @{$iref}[USER, UNKN]); } } #processing @Data after this
    You should also consider using Text::CSV_XS
Re: Data parsing - takes too long
by ptum (Priest) on Jan 13, 2006 at 18:50 UTC

    It seems to me that this problem cries out for an associative array (hash). Try splitting your inventory file and storing the array of elements you need in a lookup hash (keyed by the first token). Then when you split your client file, you can check to see if a hash entry matching the 4th token exists, and if so, grab the necessary element array from that hash entry. You can use delete() to get rid of unwanted hash entries once you're done with them.

    Update: Added (untested) code sample:

    my %inventory_hash = (); open DataFile, "Inventory.csv" or die ("Inventory.csv $!"); while (my $inventory_line = <DataFile>) { my @inventory_fields = split /,/,$inventory_line; my $inventory_key = shift @inventory_fields; if (exists($inventory_hash{$inventory_key})) { print "Warning! Duplicate inventory key detected: $inventory_key\n +"; next; } $inventory_hash{$inventory_key} = \@inventory_fields; } close DataFile; open DataFile, "clients.txt" or die ("clients.txt $!"); while (my $line = <DataFile>) { my @fields = split(/,/, $line); if (exists($inventory_hash{$fields[3]})) { my @sys = @{$inventory_hash{$fields[3]}}; # do stuff with the contents of @sys, print, whatever delete $inventory_hash{$fields[3]}; } }

    Hope this rather abbreviated example helps.


    No good deed goes unpunished. -- (attributed to) Oscar Wilde
      Hi putm and L~R,
      Thanks a ton for the help. Hash saved me. I'm learning the power of hash day by day.
      Joseph
Re: Data parsing - takes too long
by glasswalk3r (Friar) on Jan 13, 2006 at 19:18 UTC

    Well, you should preprocess the data as much as you can before getting into the while loop. As the others monks have said, you should use a hash for faster lookup. If creating the hash with the entire file is consuming too much memory, try to break the input file in pieces, if possible. You should try it for the second file, too, if possible. Putting the entire file into memory is faster than reading line per line (if memory is cheap for you).

    There are some other improvements you can try. I recommend this IBM article if you want to improve the performance of your program: http://www-128.ibm.com/developerworks/library-combined/l-optperl.html.

    You should try to profile your program (with less input data, of course) using Dprof, even before trying to optimize the code. Doing that will help you to find out the area in your code that is wasting more time to execute. Here is a good link for that: http://www.perl.com/pub/a/2004/06/25/profiling.html


    Alceu Rodrigues de Freitas Junior
    ---------------------------------
    "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
      If creating the hash with the entire file is consuming too much memory, try to break the input file in pieces, if possible

      or better use DB_File or similar to create an on disk hash/tree.

Re: Data parsing - takes too long
by Fletch (Bishop) on Jan 13, 2006 at 19:00 UTC

    Consider also loading the data into an RDMBS (even just something like SQLite) and letting it do the heavy lifting.