Re: Re: Need to process a tab delimted file *FAST*

An obvious performance problem is that you keep on accessing $hash->$key. Perl has to do the work of a hash lookup every time. Insert my $value = $hash->$key; and access $value repeatedly instead.

Another small speedup. Don't use keys to dump data into an array that only exists for you to run through it. Just run directly through the array.

If you can figure out how to push both the data and this work to a decent relational database, you will see bigger speedups still.

Moving to a faster machine always helps. If your code is I/O bound and you can parallelize it, then that can speed things up. The same happens if you are CPU bound and have multiple CPUs or computers to take advantage of.

But at 0.0008 seconds per record, if it takes over 60 seconds, then you have over 75,000 records. At some point you have to understand that computers are not magic. Doing work takes time. You aren't going to hit a particular performance level just because the powers that be said that you must.

Comment on Re: Re: Need to process a tab delimted file FAST Download Code

Replies are listed 'Best First'.
Re: Re: Re: Need to process a tab delimted file FAST by fizbin (Chaplain) on Mar 04, 2004 at 01:59 UTC
Specifically, tilly's advice can be implemented with judicious use of `each`. Doing this transforms your example into: my $junk; while (my ($outterkey,$outterval) = each %$hash) { if(ref($outterval)) { while (my ($innerkey,$innerval) = each %$outterval) { $junk->{'TOT::'.$key}->{$key2}+=$innerval; if($innerval > $junk->{'MAX::'.$key}->{$key2}) { $junk->{'MAX::'.$key}->{$key2}=$innerval; } if(!defined($junk->{'LAST::'.$key}->{$key2})) { $junk->{'LAST::'.$key}->{$key2}=$innerval; } } } else { $junk->{'TOT::'.$key}+=$outterval; if($outterval > $junk->{'MAX::'.$key}) { $junk->{'MAX::'.$key}=$outterval; } if(!defined($junk->{'LAST::'.$key})) { $junk->{'LAST::'.$key}=$outterval; } } } [download] Now, you've still got two recalculations of $junk->{'MAX::'.$key} and $junk->{'LAST::'.$key} (each involving a string concatenation and a hash lookup) per loop, whereas it might be possible to have only one, but I'll let you look at this first, to see if it gives acceptable speed improvements (not likely, given that this is squeezing out the last few extra percent, and you want 90% of the time gone). One test I'd do if I were you to make certain that the kind of speed you want is even vaguely possible is time the unix `wc` command against the input file. I often use wc as a absolute lower bound when looking at file processing speed issues, and figure that I've done as good as I can if I can get the speed down to within twice the time that the wc executeable takes. Also, on your initial text processing question, is this a faster way to split up the file? `my %hash; while ($line = <INPUTFILE>) { chomp($line); while($line =~ /\t([^=])=([^\t])/g) { $hash{$1}=$2; } }` [download] This assumes that you're discarding $tstamp, which you seem to be.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: Need to process a tab delimted file *FAST*
by fizbin (Chaplain) on Mar 04, 2004 at 01:59 UTC

each

my $junk;
while (my ($outterkey,$outterval) = each %$hash)
{
  if(ref($outterval))
  {
    while (my ($innerkey,$innerval) = each %$outterval)
    {
       $junk->{'TOT::'.$key}->{$key2}+=$innerval;
       if($innerval > $junk->{'MAX::'.$key}->{$key2})
       {
          $junk->{'MAX::'.$key}->{$key2}=$innerval;
       }
       if(!defined($junk->{'LAST::'.$key}->{$key2}))
       {
          $junk->{'LAST::'.$key}->{$key2}=$innerval;
       }
    } 
  }
  else
  {
    $junk->{'TOT::'.$key}+=$outterval;
    if($outterval > $junk->{'MAX::'.$key})
    {
      $junk->{'MAX::'.$key}=$outterval;
    }
    if(!defined($junk->{'LAST::'.$key}))
    {
       $junk->{'LAST::'.$key}=$outterval;
    }
  } 
}
[download]

One test I'd do if I were you to make certain that the kind of speed you want is even vaguely possible is time the unix wc command against the input file. I often use wc as a absolute lower bound when looking at file processing speed issues, and figure that I've done as good as I can if I can get the speed down to within twice the time that the wc executeable takes.

Also, on your initial text processing question, is this a faster way to split up the file?

my %hash;
while ($line = <INPUTFILE>) {
  chomp($line);
  while($line =~ /\t([^=]*)=([^\t]*)/g)
  {
    $hash{$1}=$2;
  }
}
[download]

[reply]
[d/l]
[select]