Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

In the middle of a perl script, I'm grepping a logfile for a URL path that begins with "/acm/". When I try to use "/^\/$dir\//" I don't get the right result. When I just use "/$dir/", I definitely don't get what I want. What would be the best method for checking a lot file for each URL that comes from the acm directory? JJ

Replies are listed 'Best First'.
Re: grepping for
by lhoward (Vicar) on May 12, 2000 at 18:42 UTC
    I think you are on the right path with your first example. You probably just need to strip the protocol and host information out of the URL first. You could use the URI module to parse the URL apart and reg-match on the path part.
    use URI; my $dir='acm'; ...looping through the file ...$foo contains the URL from the logfile my $u=new URI($foo); if($u->path()=~/^\/$dir\//){ .. do stuff in here 'cause we got an "/acm/" line }
    That probably isn't the fastest or most efficient way to do it, so you may want to tune the code if this is something that you will do often.
Re: grepping for
by chromatic (Archbishop) on May 12, 2000 at 19:07 UTC
    Does $dir contain only "acm"? What does your regex look like in the code? What do you think it looks like when $dir is expanded?

    I might write code something like this:

    my $dir = "/acm"; # open file while (<LOG>) { next unless /^\Q$dir\E/; # do something with $_ because it matches }
    If you slurp all of the lines into an array, you could also use the grep command.
Re: grepping for
by mikfire (Deacon) on May 12, 2000 at 18:48 UTC
    Try using m#$dir# instead of /$dir/. Do not hold me to this, but I believe the slashes in the regex are confusing perl.

    Perl first does double-quotish expansion ( how does this phrase seem to appear in all my posts? ) before passing anything to the regex engine. In this case, the regex engine is seeing something like /^\//acm\//. Which makes me think it is returning any line beginning with a '/'.

    You could also try using the quotemeta modifiers \Q and\E which cause perl to protect all special characters with backslashes. To do this, your regex would look like /\Q$dir\E/. I had mixed luck with the quotemeta stuff a long time ago. It may work better now or your mileage may vary.

    Mik
    Mik Firestone ( perlus bigotus maximus )

RE: grepping for
by Maqs (Deacon) on May 12, 2000 at 19:11 UTC
    Why not use simply
    if ($url =~ /\/$dir\//) { #... some stuff ...<br> };

    IMHO, if you have a long path, your "^" in regexp might not work properly
    --
    With best regards
    Maqs.
Re: grepping for
by jjhorner (Hermit) on May 12, 2000 at 19:50 UTC
    Well, I got it working, although it is amazingly slow, using if (/^\/$dir\// && /$error/) {..do something..}.
    #!/usr/bin/perl -w # # error-report.pl # # usage: # error-report.pl <dir> <error> # where: # <dir> is the directory of the application # <error> is the number of the error code # # requires a filtered copy of a log file to exist. # # v0.1, jh8@ornl.gov, 5/12/2000 # use strict; # my $file = shift; my $dir = shift; my $error = shift; $error = " ".$error." "; open (LOG, "/usr/local/apache/logs/access_log") || die "Can't open log +file: $!"; my (@entries, @log, %report, @list); while (<LOG>) { my $url = (split())[6]; if ($url =~ /^\/$dir\// && /$error/) { if (exists $report{$url}) { $report{$url}++; } else { $report{$url} = 1; } } } close LOG; @list = sort {$report{$b} <=> $report{$a}} keys %report; foreach (@list) { print "$_: $report{$_}\n"; }
    I would like to find a way to do it quicker, possibly using "grep" and slurping the file into memory, but since the file is a over 200MB, I'm not too optimistic. Thanks for your help. Linux, Perl, Apache, Stronghold, Unix jhorner@knoxlug.org http://www.knoxlug.org
      Instead of your current while loop try this one which should be somewhat faster. I replaced your first regular expressions with a "substr eq" combination.
      my $dir='/acm/'; my $strlen=length ($dir) while (<LOG>) { my $url = (split())[6]; if ((substr($url,0,$strlen) eq $dir) && ($url=~/$error/)) { if (exists $report{$url}) { $report{$url}++; } else { $report{$url} = 1; } } }
      I benchmarked this using some sample data I made up and it is about %35 faster than the original version. Your performance may vary.

      Regular-expressions are slower than straight string operations. So if you can accomplish what you want with string operations and performance matters, then you can tune your code by replacing some regular-expressions with string operations (only feasible for simple regular-expressions).

        Your code looks good, but in order for me to get the right error codes, $url needs to be checked either against the entire line, or against $_[-1]. Thanks again, JJ
      index(), in turn, may be a little faster than substr(). And replacing the regex on the error code should help, too. [code removed to protect the innocent] Update:

      index() is faster when the match occurs at the beginning of the string, but substr() is much better when there is no match at all, which happens quite a lot in this application.

      $dir = "/$dir/"; my $strlen = length $dir; while (<LOG>) { my ($url, $code) = (split)[6,8]; if (substr($url, 0, $strlen) eq $dir && $code == $error) { $report{$url}++; } }
      Put the /o modifier on the end of those regexes. Since $dir and $error don't appear to change at all during the loop, you can optimize the regexp by only building it once. Could save some time.
RE: grepping for
by turnstep (Parson) on May 12, 2000 at 19:22 UTC
    If it is a standard log file, it will not have host and protocol information. Try something like this:
    $dir = "/acm/"; ## avoid worrying about escaping slashes if (m/($dir.*) /) { print "The path is $1!\n"; }
    Your first method should work, by the way. Perhaps give us an example line from the log file and what result you get?