Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

SAS log scanner

by nop (Hermit)
on Sep 01, 2000 at 23:31 UTC ( [id://30791]=sourcecode: print w/replies, xml ) Need Help??
Category: utilities
Author/Contact Info nop
Description: Large SAS programs generate large log files. This simple scanning perl script looks for signs of trouble and reports their location within the log.
# scan SAS logs for problems 
# aprk
# 7/20/98

@problem = (
"0  OBS",
"AT LEAST",
"EXTRANEOUS",
"CARTESIAN",
"CLOSING",
"CONVERT",
"DIVISION BY ZERO",
"DOES NOT EXIST",
"DUE TO LOOPING",
"END OF MACRO",
"ENDING EXECUTION",
"ERROR",
"ERRORABEND",
"ERRORCHECK=STRICT",
"EXCEED",
"HANGING",
"HAS 0 OBSERVATIONS",
"ILLEGAL",
"INCOMPLETE",
"INVALID",
"LOST CARD",
"MATHEMAT",
"MERGE STATEMENT",
"MISSING",
"MULTIPLE",
"NOT FOUND",
"NOT RESOLVED",
"OBS=0",
"REFERENCE",
"REPEAT",
"SAS CAMPUS DRIVE",
"SAS SET OPTION OBS=0",
"SAS WENT",
"SHIFTED",
"STOP",
"TOO SMALL",
"UNBALANCED",
"UNCLOSED",
"UNINITIALIZED",
"UNREF",
"UNRESOLVED",
"WARNING"
);
$numproblem = @problem;

while(<>) {
  $line++;  
  for ($i=0; $i<$numproblem; $i++) {
    $p = $problem[$i];
    if (/$p/i) {print "line $line: problem: $problem[$i]\n$_\n";}
  }
}
Replies are listed 'Best First'.
RE (tilly) 1: SAS log scanner
by tilly (Archbishop) on Sep 02, 2000 at 01:57 UTC
    OK, I am going to have everything from trivial style to obscure performance issues. :-)

    First, use strict is a very good habit.

    Secondly there is no need to keep track of the line number, that is already in $.. Likewise the filename you are reading from is in $ARGV. So the extra info you want is all available.

    Thirdly the explicit C-style for loop is slower than a native Perl style foreach loop. Plus that lets you get rid of $numproblem.

    Fourth the constant interpolation in the RE will be slow, you can and should push the looping logic down to the RE which can do that much faster than you can. Just quotemeta the strings, join with pipes, and then use qr to produce your RE.

    Finally it is possible to speed up the RE with trieing. Someone threatened a CPAN module to do that, but I don't know if it happened. In any case the optimization should eventually show up in the RE engine. But I will go home and code it up for fun anyways...expect more from me here over the weekend. :-)

      Doing a great big regex with lots of /this|that|the other|etc/ is documented as probably being slower than /this/ || /that/ || /the other/ || /etc/ (or at least was documented -- this may have been dropped when perlre.pod was created, perhaps because it is no longer true). Of course, this will vary by case so a benchmark will tell you for sure.

      So, based on my old bias, I'd code it one of these two ways:

      @problem = ( "0 OBS", "AT LEAST", "EXTRANEOUS", "CARTESIAN", "CLOSING", "CONVERT", "DIVISION BY ZERO", "DOES NOT EXIST", "DUE TO LOOPING", "END OF MACRO", "ENDING EXECUTION", "ERROR", "ERRORABEND", "ERRORCHECK=STRICT", "EXCEED", "HANGING", "HAS 0 OBSERVATIONS", "ILLEGAL", "INCOMPLETE", "INVALID", "LOST CARD", "MATHEMAT", "MERGE STATEMENT", "MISSING", "MULTIPLE", "NOT FOUND", "NOT RESOLVED", "OBS=0", "REFERENCE", "REPEAT", "SAS CAMPUS DRIVE", "SAS SET OPTION OBS=0", "SAS WENT", "SHIFTED", "STOP", "TOO SMALL", "UNBALANCED", "UNCLOSED", "UNINITIALIZED", "UNREF", "UNRESOLVED", "WARNING" ); # First way: my $code= "sub { /(" . join ")/i || /(", map {"\Q$_\E"} @problem; $code .= ")/i }"; my $match= eval $code; die "$@" unless ref($sub) && UNIVERSAL::isa($sub,"CODE"); while(<>) { if( &$match() ) { print "line $.: problem: $1\n$_\n"; } } # Second way: while(<>) { my $up= upcase $_; foreach my $p ( @problem ) { if( 0 <= index($up,$p) ) { print "line $.: problem: $p\n$_\n"; last; } } }

      Both of my solutions differ from the original in that they only report one problem per line. My "second way" can be made like the original by simply removing the "last;" line. I see no simple way to make my "first way" like the original.

      I apologize, but I didn't feel up to creating a test input file so I didn't test nor benchmark. I'd be interested in seeing test and benchmark results on real data.

              - tye (but my friends call me "Tye")
        Gosh dang, but you are right.

        I just finished creating my sophisticated REs, saw your response, tried both of your versions. Your first one crashed, but your second ran just fine significantly faster than mine. Mine outran the original by a good factor as well. (I used the example code as a logfile to test. :-)

        I guess the constant string search really is a nice win. Thanks for that lesson! :-)

        I don't doubt that index() is faster than regexes; but it seems that, for regexes, the coderef approach would be slower than a qr// method:
        @problem = map [$_, qr/\Q$_/i], @problem; while (<>) { for my $p (@problem) { print "line: $. problem: $p->[0]\n" if /$p->[1]/; } }
        But I'm also too lazy to generate a good input file for benchmarking.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://30791]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-26 00:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found