vxp has asked for the wisdom of the Perl Monks concerning the following question:

What I am trying t odo with this script here is parse through a lot of data, and generate sql queries to update the db with values that i'm getting while this script's running. however, its very slow at the moment. i need suggestions on how to speed this thing up :)
#!/usr/bin/perl -w use strict; sub weekday { my ($day, $month, $year) = @_; if ($month < 3) { $month += 12; --$year; } my $tmp = $day + int((13 * $month - 27)/5) + $year + int($year/4) +- int($year/100) + int($year/400); return ($tmp % 7); } my $days_back = shift || 120; my ($start, $end); my @days = ("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"); my @days_month = (31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31); my ($day, $month, $year) = (localtime)[3,4,5]; # Adjust year and month to readable stuff $year += 1900; $month++; $end = sprintf("%d-%02d-%02d", $year,$month,$day); $day -= $days_back; # If we went below 1, then we're into prev month # we while() here for when $days_back is larger than 1 month while ($day < 1) { # Don't forget to pad Feb for leap years if ($month == 3 && ($year % 4) == 0) { $days_month[1]++ } # deceremnt the month and calc the day $day += $days_month[--$month - 1]; if ($month < 1) { $month = 12; $year--; } } $start = sprintf("%d-%02d-%02d", $year,$month,$day); print"$start\n$end\n\n"; system("parsecache -f nlclick -s $start -e $end > nlclick.cache"); system("parsecache -f nlimage -s $start -e $end > nlopen.cache"); #print "$start\n$end\n\n"; for (my $i = 0; $i < $days_back; $i++) { my $wday = weekday($day, $month, $year); if ($wday > 0 && $wday < 6) { my $date = sprintf("%d%02d%02d", $year,$month,$day); #print "$date\n"; $_ = `report pattern --regex='nl=$date' --filter=nlimage --sum +only < nlopen.cache`; my ($opens) = /(.*)/; if (defined($opens)) { $_ = `report pattern --regex='nl=$date' --filter=standard +--sumonly < nlclick.cache`; my ($transfers) = /(.*)/; if (!defined($transfers)) { $transfers = 0 } printf("update msnNewsletter set open='%d', click='%d' whe +re dateCreated like '%d-%02d-%02d%%';\n", $opens, $transfers,$year,$m +onth,$day); } } if (++$day > $days_month[$month - 1]) { $day = 1; if (++$month > 12) { $month = 1; $year++; } } } unlink "nlclick.cache"; unlink "nlopen.cache";
if you need to look at the report tool that im using there, tell me and i will post that as well.

Replies are listed 'Best First'.
Re: Parsing a lot of data. Very slow. Need suggestions
by dws (Chancellor) on Aug 15, 2002 at 20:31 UTC
    its very slow at the moment. i need suggestions on how to speed this thing up :)

    I count 16 process invocations (17, including the script), and from the look of things, the script is contributing noise to the overall time. It should be trivial to add some timing code to measure how much time each subprocess is taking.

    Then you can figure out how to optimize or combine your subprocess scripts. For example, do you really need to run parsecache twice to generate separate files? Or could that script be changed to spit out two files from a single invocation? Ditto for report. That would cut subprocess invocation in half, which would have to be a win.

Re: Parsing a lot of data. Very slow. Need suggestions
by FoxtrotUniform (Prior) on Aug 15, 2002 at 19:48 UTC
      i need suggestions on how to speed this thing up

    Start by profiling it, to figure out which bits are slowing you down. Don't waste effort optimizing code that's already running fast enough.

    --
    F o x t r o t U n i f o r m
    Found a typo in this node? /msg me
    The hell with paco, vote for Erudil!

Re: Parsing a lot of data. Very slow. Need suggestions
by DamnDirtyApe (Curate) on Aug 15, 2002 at 19:53 UTC

    Well, off the top of my head, I'd say you could make all of your date calculations a whole lot simpler with Date::Calc.

    Update: Here's the script, sans hand-rolled date processing:

    #! /usr/bin/perl use strict ; use warnings ; $|++ ; use Date::Calc qw( Add_Delta_Days Day_of_Week Today ) ; my $days_back = shift || 120; my @start_ymd = Add_Delta_Days( Today, -$days_back ) ; my @end_ymd = Today ; my $start = sprintf( "%d-%02d-%02d", @start_ymd ) ; my $end = sprintf( "%d-%02d-%02d", @end_ymd ) ; print"$start\n$end\n\n"; system("parsecache -f nlclick -s $start -e $end > nlclick.cache"); system("parsecache -f nlimage -s $start -e $end > nlopen.cache"); for my $i ( 0 .. $days_back - 1 ) { my $wday = Day_of_Week ( @start_ymd ) ; if ( $wday > 0 && $wday < 6 ) { my $date = sprintf("%d%02d%02d", @start_ymd ); #print "$date\n"; $opens = `report pattern --regex='nl=$date' --filter=nlimage - +-sumonly < nlopen.cache`; if ( defined( $opens ) ) { $transfers = `report pattern --regex='nl=$date' --filter=s +tandard --sumonly +< nlclick.cache`; if ( !defined( $transfers ) ) { $transfers = 0 } printf("update msnNewsletter set open='%d', click='%d' whe +re dateCreated like '%d-%02d-%02d%%';\n", $opens, $transfers, @start_ +ymd ); } } } unlink "nlclick.cache"; unlink "nlopen.cache"; __END__

    You're original handling of the system calls is rather odd, assigning to $_ and then capturing it all to a variable. The only thing that comes to mind is taint checking, which you're not doing here, so that section can also be simplified some.

    Note that none of these necessarily have anything to do with the speed of your program, but may impact it's accuracy, and certainly improve it's maintainability.


    _______________
    DamnDirtyApe
    Those who know that they are profound strive for clarity. Those who
    would like to seem profound to the crowd strive for obscurity.
                --Friedrich Nietzsche
Re: Parsing a lot of data. Very slow. Need suggestions
by mephit (Scribe) on Aug 15, 2002 at 21:06 UTC
    Hmm, I haven't gone over this with a fine-tooth comb, but do you really need to have those `report pattern...` lines inside a for loop? The only variable inside those two lines is $date, which appears to be a function of other variables that are set outside the for loop. In other words, are you executing the exact same system call in those backticks each time through the loop? (Maybe the debugger or profiler would be able to tell you? I don't really know.) If that's the case, you might want to move those lines outside of the loop? Or maybe you have a reason for calling the same exact command several times. Or maybe I'm just missing the point completely. *shrug*

    You also mentioned "SQL queries" and updating a database. There's no SQL in this script at all. Perhaps it's in one of the external programs? Regardless, if you're not using placeholders when updating your database, you may want to consider doing so. That'll save time when interacting with the database.

    --

    There are 10 kinds of people -- those that understand binary, and those that don't.

      There's no SQL in this script at all.

      I wonder what

      printf("update msnNewsletter set open='%d', click='%d' where dateCreat +ed like '%d-%02d-% +02d%%';\n", $opens, $transfers,$year,$month,$day);
      is, then? :)
      --
      Mike
      its not the same command, because the $date variable gets changed each time the process is called. in this case here, its 120 dates. 120 different commands. and then 120 more, for another similar command. they need to be in a loop. :)
        use place holders. There is a lot of overhead in your rdbms reparsing an sql statement...