in reply to Need to process a tab delimted file *FAST*

Thanks for your help everyone...
Here's where I am at now:
#1. Text::CSV_XS is extremely slow for some reason.
#2. SQL-lite/etc are not feasable here...
#3. I would like to see a 10-fold performance increase over the figures I describe below....
#4. So far Storable seems like the best option, but this leaves me with another problem... Please allow me to explain further:

This is essentially a "log file" containing something which might be represented this way:
hash->{KEY1}=VALUE1 hash->{KEY2}=VALUE2 hash->{KEY3}->{SUBKEY1}=VALUE3 hash->{KEY3}->{SUBKEY2}=VALUE4 hash->{KEY4}=VALUE5


So each "record" will contain at this point in time 16 keys (it is expected to grow).

.. That being said, here's the 'rest of the story': I need to compute TOTAL, MAXIMUM, and LAST values for each one of the elements of the hash (include "SUBKEY"'s) over several periods of time.. So for a time period of like an hour I would be looking for the total, maximum, and last value of each one of those values.... The way I did this before was to build a second hash, modeled after the first with the keys changed to "TOT::(key)", "MAX::(key)", and "LAST::(key)"...
Benchmarks have shown me that doing the "split/map" method and computing these values takes somewhere around .0008 seconds (cpu time) per record. Using the Storable method, I am able to read the original hash in very quickly, but when iterating through it I am still showing benchmarks of .0007 seconds. The number of records are enough that after doing all of this at .0008 seconds it ends up taking more then 60 seconds of machine time. While this may not see enormous, it pushes me beyond the window I have to work in....
Here is the snippet I am using to iterate through the hash (note that I iterate through the array in reverse order newest to oldest records).. I am also typing this from memory, so it may not be exactly correct...:
my @keys=keys(%{$hash}); my $junk; foreach my $key (@keys) { if(ref($hash->{$key})) { my @keys2=keys(%{$hash->{$key}); foreach my $key2 (@keys2) { $junk->{'TOT::'.$key}->{$key2}+=$hash->{$key}->{$key2}; if($hash->{$key}->{$key2} > $junk->{'MAX::'.$key}->{$key2}) { $junk->{'MAX::'.$key}->{$key2}=$hash->{$key}->{$key2}; } if(!defined($junk->{'LAST::'.$key}->{$key2})) { $junk->{'LAST::'.$key}->{$key2}=$hash->{$key}->{$key2}; } } } } else { $junk->{'TOT::'.$key}+=$hash->{$key}; if($hash->{$key} > $junk->{'MAX::'.$key}) { $junk->{'MAX::'.$key}=$hash->{$key}; } if(!defined($junk->{'LAST::'.$key})) { $junk->{'LAST::'.$key}=$hash->{$key}; } } }

Thanks, again, for all the helpful replies!

You folks are great!

- Greg

Replies are listed 'Best First'.
Re: Re: Need to process a tab delimted file *FAST*
by tilly (Archbishop) on Mar 03, 2004 at 15:46 UTC
    An obvious performance problem is that you keep on accessing $hash->$key. Perl has to do the work of a hash lookup every time. Insert my $value = $hash->$key; and access $value repeatedly instead.

    Another small speedup. Don't use keys to dump data into an array that only exists for you to run through it. Just run directly through the array.

    If you can figure out how to push both the data and this work to a decent relational database, you will see bigger speedups still.

    Moving to a faster machine always helps. If your code is I/O bound and you can parallelize it, then that can speed things up. The same happens if you are CPU bound and have multiple CPUs or computers to take advantage of.

    But at 0.0008 seconds per record, if it takes over 60 seconds, then you have over 75,000 records. At some point you have to understand that computers are not magic. Doing work takes time. You aren't going to hit a particular performance level just because the powers that be said that you must.

      Specifically, tilly's advice can be implemented with judicious use of each. Doing this transforms your example into:
      my $junk; while (my ($outterkey,$outterval) = each %$hash) { if(ref($outterval)) { while (my ($innerkey,$innerval) = each %$outterval) { $junk->{'TOT::'.$key}->{$key2}+=$innerval; if($innerval > $junk->{'MAX::'.$key}->{$key2}) { $junk->{'MAX::'.$key}->{$key2}=$innerval; } if(!defined($junk->{'LAST::'.$key}->{$key2})) { $junk->{'LAST::'.$key}->{$key2}=$innerval; } } } else { $junk->{'TOT::'.$key}+=$outterval; if($outterval > $junk->{'MAX::'.$key}) { $junk->{'MAX::'.$key}=$outterval; } if(!defined($junk->{'LAST::'.$key})) { $junk->{'LAST::'.$key}=$outterval; } } }
      Now, you've still got two recalculations of $junk->{'MAX::'.$key} and $junk->{'LAST::'.$key} (each involving a string concatenation and a hash lookup) per loop, whereas it might be possible to have only one, but I'll let you look at this first, to see if it gives acceptable speed improvements (not likely, given that this is squeezing out the last few extra percent, and you want 90% of the time gone).

      One test I'd do if I were you to make certain that the kind of speed you want is even vaguely possible is time the unix wc command against the input file. I often use wc as a absolute lower bound when looking at file processing speed issues, and figure that I've done as good as I can if I can get the speed down to within twice the time that the wc executeable takes.

      Also, on your initial text processing question, is this a faster way to split up the file?

      my %hash; while ($line = <INPUTFILE>) { chomp($line); while($line =~ /\t([^=]*)=([^\t]*)/g) { $hash{$1}=$2; } }
      This assumes that you're discarding $tstamp, which you seem to be.
Re: Re: Need to process a tab delimted file *FAST*
by clintp (Curate) on Mar 03, 2004 at 13:56 UTC
    If the I/O isn't your problem then consider the following:

    * Don't use Perl. Re-write this in C, keep your keys in a tree of some kind (determined by the distribution of the keys) and the values (tot, max, last) in the tree. It's always faster to use a specifically designed tool (a single-purpose, well written C program) than a general purpose multitool (like Perl).

    * If you have any control over the input stream: instead of tab-delimited data, use space/column-delimited data. It's (probably) faster to extract two substrings of known width than to split. Leave the whitespace on the keys when storing, remove it later when redisplaying (if necessary).

Re: Re: Need to process a tab delimted file *FAST*
by jfroebe (Parson) on Mar 03, 2004 at 16:30 UTC
    Hi,

    I just wonder if splitting the workload between two or more instances of your script will help. Whether this means updating a shared memory structure at the end of the processing, or similar (please forgive me if this doesn't make alot of sense - I have the flu and am at work).

    For example, script #1 will process the first 10,000 lines, script #2 will process the second 10,000 lines, etc. then update a common structure at the end of the processing or parallel piped to a third script that merges the data

    jason