Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hello monks,
I have created a script that manipulates text files around 80 meg in size (1.5 million lines approximately), and the following block of code seems to be a bottleneck. The current code takes around 19 minutes to execute on a sun box with a single 450 mhz processor (64 bit)
while (<SSLOG>)
{
chomp (($date,$atime,$etime,$mth,$port,$attrID,$value)=split(/;/,$
+_));
# Convert the log time to UTC time
($HH, $MM, $SS)=split(/:/,$etime);
($year, $month, $day) = $date =~ m/^(.{4})(.{2})(.*)/;
$month -= 1;
$Time = timelocal($SS, $MM, $HH, $day, $month, $year);
# Collect non-zero data
if ($Conf[7] =~ /NO/i) {
if ($value != /^0/) {
# Build a hash of MTypeHandles with their unique ports
$Hash{$mth}{$port}++;
my @entry = ();
$attrID =~ s/\s*//g;
$port =~ s/\s*//g;
if ($port =~ /\-/) {
$port =~ s/\-//;
}
$key = $attrID;
$val = $DS_Info{$key};
$val = join("", $val, $port);
@entry = ($Time, $mth, $val, $value);
# Populate the data array
push @Data, [@entry];
}
}
# Collect zero data
elsif($Conf[7] =~ /YES/i) {
# Build a hash of MTypeHandles with their unique ports
$Hash{$mth}{$port}++;
my @entry = ();
$attrID =~ s/\s*//g;
$port =~ s/\s*//g;
if ($port =~ /\-/) {
$port =~ s/\-//;
}
$key = $attrID;
$val = $DS_Info{$key};
$val = join("", $val, $port);
@entry = ($Time, $mth, $val, $value);
# Populate the data array
push @Data, [@entry];
}
}
Any suggestions for improving performance?
Thanks
Re: Performance revision needed
by pfaut (Priest) on Jan 02, 2003 at 16:32 UTC
|
if ($Conf[7] =~ /NO/i) {
...
elsif($Conf[7] =~ /YES/i) {
@Conf appears to be setup outside of this loop. Evaluate this once and save the result instead of parsing a regex every time through the loop.
$attrID =~ s/\s*//g;
$port =~ s/\s*//g;
I think replacing the '*' with '+' might help here.
$val = join("", $val, $port);
How about $val .= $port?
push @Data, [@entry];
Since @entry is a lexical variable, you should be able to use \@entry instead of [@entry] which needs to make a copy of the list.
Unless I'm missing something, both blocks of code do the same thing. It might be better from a maintenance standpoint to check if either condition is true and only have the block once.
---
print map { my ($m)=1<<hex($_)&11?' ':'';
$m.=substr('AHJPacehklnorstu',hex($_),1) }
split //,'2fde0abe76c36c914586c';
| [reply] [d/l] [select] |
Re: Performance revision needed
by gjb (Vicar) on Jan 02, 2003 at 16:37 UTC
|
| [reply] [d/l] [select] |
|
great ideas... you wouldn't happen to have a suggestion for some revised conditional statements would you...I've been staring at this code too long....
Thanks again
| [reply] |
Re: Performance revision needed
by talexb (Chancellor) on Jan 02, 2003 at 16:47 UTC
|
As I understand ir, using the 'i' modifier in Perl slows down the regexp code. You might benchmark using /YES/ || /yes/ instead.
You appear to be using the s operator to remove any spaces. My guess is that you can replace that with tr/ //d instead, and get a speed improvement.
Finally, it looks like the two blocks of code are identical. Can you instead check for (NO and $value not zero) or YES? to do that block of code?
--t. alex
Life is short: get busy!
| [reply] |
Re: Performance revision needed
by rir (Vicar) on Jan 02, 2003 at 17:41 UTC
|
#!/usr/bin/perl
use warnings;
use strict;
# move declarations out of loop
#
my ( $date, $atime, $etime, $mth, $port, $attrID, $value );
my ( $HH, $MM, $SS );
my ( $year, $month, $day );
my @entry ;
while (<SSLOG>) {
# chomp one item not seven
chomp;
($date, $atime, $etime, $mth, $port, $attrID, $value) =
split /;/, $_ ;
# combine conditions into one
# could be optimized by knowing the data
# and short circuiting
#
if ( ($Conf[7]=~ /NO/i) && ($value!= /^0/)
or ( $Conf[7] =~ /YES/i)
) {
# Move time computation inside conditional
#
# Convert the log time to UTC time
( $HH, $MM, $SS ) = split ( /:/, $etime );
( $year, $month, $day ) =
$date =~ m/^(.{4})(.{2})(.*)/;
$month -= 1;
$Time= timelocal($SS, $MM, $HH, $day, $month, $year);
# Build a hash of MTypeHandles with their unique ports
$Hash{$mth}{$port}++;
$attrID =~ s/\s*//g;
$port =~ s/\s*//g;
# Remove unnecessary condition
#
# if ( $port =~ /\-/ ) {
# $port =~ s/\-//;
# }
$port =~ s/\-//;
#remove unnecessary variable
#
#$key = $attrID;
#$val = $DS_Info{$key};
#
$val = $DS_Info{$attrID};
#Remove unnecessary join.
#$val = join ( "", $val, $port );
#
$val .= $port;
# Remove unnecessary variable
# @entry = ( $Time, $mth, $val, $value );
#
# This is a big array you are making
# What do you do with it?
# Is there any other choice
#
# Populate the data array
# push @Data, [@entry];
#
push @Data, [ $Time, $mth, $val, $value ];
}
}
| [reply] [d/l] |
|
Thank you rir,
Your suggestions and a combination of the others have improved performance greatly.
Thanks again rir, and others who who posted suggestions
| [reply] |
Re: Performance revision needed
by Thelonius (Priest) on Jan 02, 2003 at 18:45 UTC
|
Many of these suggestions are good, but they are all micro-optimizations. I suspect that your problem is that you are using a lot of memory. Try commenting out the lines that save the data, specifically, $Hash{$mth}{$port}++;
# and especially:
push @Data, [@entry];
Then run it and see how fast it is. If it is fast, then you should look into getting rid of @Data and maybe $Hash. It may be that $Hash isn't that large--it's hard to tell from your code. (Also, don't you want to edit $port before you increment $Hash{$mth}{$port}?)
The easiest thing to do is probably write ($Time, $mth, $val, $value) out to a new file and process that in a second pass just as you now process @Data. Or, possibly, you can do it in one pass and not save the data at all.
| [reply] [d/l] |
Re: Performance revision needed
by Nitrox (Chaplain) on Jan 02, 2003 at 16:19 UTC
|
Update:
Thanks to IlyaM for pointing out my mis-choice of modules and explaining how Devel::SmallProf is a better choice than Devel::DProf.
-Nitrox | [reply] |
|
| [reply] |
Re: Performance revision needed
by poj (Abbot) on Jan 02, 2003 at 19:33 UTC
|
Taking a guess at what your records looked like
20020202;10:10:10;12:12:12;12;234;attid ;1000
I created a 500k line file and ran the code suggested by rir on my winXP 1.1GHz 256Mb laptop and it took 57 seconds. I realise that because all the records are the same this is not a very scientific test.
Without the timelocal call it was 26 seconds.
However, 19 minutes does seems a long time. Maybe memory or disk IO is having an significant effect. poj | [reply] [d/l] |
|
|