Shamaeso has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to take a log file and match it against a file of spiders/searchbots to weed them out.

I want to delete the matching strings before writing out to a file. I tried using index and now grep. I am pretty new to this type ofuse for perl. I normally just do cgi stuff. Thanks in advance for your time.

Here is my code:

#!perl use strict; use Text::CSV_XS; use warnings; use FileHandle; # declarations my($log_file_path)="\\\\dt00mx84\\LogArchive\\www.ksdot.org\\dt00m +h77\\"; my($robot_file)="\\\\dt00mx84\\LogArchive\\Webrobots.txt"; my($second, $minute, $hour, $dayOfMonth, $month, $yearOffset, $day +OfWeek, $dayOfYear, $daylightSavings) = localtime(); my$x="ex"; my($extension)=".log"; my($year)=1900+$yearOffset; my($month_new)=1+$month; my $day=$dayOfMonth-1; my @logmessages; # build filename format # this is how log files are named using the yy/mm/dd formatex040101.lo +g if (length($month_new)< 2) { $month_new="0".$month_new }; if (length($dayOfMonth)< 2) { $dayOfMonth="0".$dayOfMonth }; # build input file name and file path to read from my($filename)=$x.substr($year,2).$month_new.$day; my($log_file)=$log_file_path.$filename.$extension; # build output file name and path to write file my($file_name)=substr($year,2).$month_new.$day; my($out_file)=$log_file_path.$file_name.$extension; # Declare the FileHandles and open the input and output files my $fh= new FileHandle; open(LOG, "<$log_file") or die "Could not open file"; @logmessages=<LOG>; close(LOG); open(ROBOT, "<$robot_file") or die "Could not open file:"; my @robots =<ROBOT>; close(ROBOT); my $outfile= new FileHandle; open(OUTFILE, ">$out_file") or die "Could not open file"; #if (index(@logmessages,@robots)==-1){ if (grep(/@logmessages/, @robots)==0){ print OUTFILE @logmessages;} close(OUTFILE);

Replies are listed 'Best First'.
Re: Deleting a matching string in an array
by moritz (Cardinal) on Oct 30, 2007 at 15:08 UTC
    You can't match an array, as you try with /@logmessages/.

    I think what you want is this:

    # if you want @robots to be interpreted as # regexes, delete the map { ... } my $regex = join '|', map { quotemeta $_ } @robots; for (grep { ! /$regex/ } @logmessages){ print OUTFILE $_; }

    Note that grep doesn't modify its arguments (normally...), so you have to continue working with the result from grep.

    (Update: added the ! in grep, ... ;-)

      A little piece of code to illustrate naikonta's point.

      $ perl -le ' -> @arr = qw{ abC def gHi JKl }; -> print for grep { s{[A-Z]}{%}g } @arr; -> print q{-} x 25; -> print qq{@arr};' ab% g%i %%l ------------------------- ab% def g%i %%l $

      Cheers,

      JohnGG

      Note that grep doesn't modify its arguments (normally...),
      Are you against Note that $_ is an alias to the list value, so it can be used to modify the elements of the LIST. or you mean something else?

      Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

        Shameaso seemed to use grep as if he expected it to modify the array, although he didn't modify $_ in grep's block.

        So I thought it was noteworthy that grep doesn't do in-place grepping.

      I appreciate the responses and thanks for giving me an example to work through.
      I had a similar solution, but yours is far more elegant. The regex assignment is very nice.
      -- I used to drive a Heisenbergmobile, but everyone I looked at the speedometer, I got lost.
Re: Deleting a matching string in an array
by thezip (Vicar) on Oct 30, 2007 at 17:35 UTC

    If I may, I have a couple of other constructive points to mention that don't really pertain to the question at hand, but rather coding style.


    There are escaped escaped backslashes in the declarations section, and this is not necessary.

    my($log_file_path)="\\\\dt00mx84\\LogArchive\\www.ksdot.org\\dt00mh77\\";

    can be rewritten as:

    my($log_file_path)="/dt00mx84/LogArchive/www.ksdot.org/dt00mh77/";

    Also, it is preferable to use singles quotes instead of double-quotes, but only when you don't need to interpolate any variables within the string.

    Examples:
    # This: # my $x="ex"; # Becomes: my $x= 'ex'; # ... since there are no variables to interpolate # This is wrong: # my $foo = '/foo/$bar/baz'; # $bar is a variable and does not get int +erpolated # instead, it's literally the string '$ba +r' # It must be written with the double-quotes to interpolate: # my $foo = "/foo/$bar/baz";

    Where do you want *them* to go today?

      Thanks, I did the double slashes since I am using a unc path and did it the way I learned like 9 years ago in a PERL class.

      I got the code to remove the unwanted lines but it is not elegant, but I wanted to post it for more comments and advice on how to improve.

      I did try a switch statement but could not get it to work so just built the ugly if statement

      Thanks again for the help and direction


      #!perl #use strict; use Text::CSV_XS; use warnings; use FileHandle; # declarations my($log_file_path)="\\\\dt00mx84\\LogArchive\\www.ksdot.org\\dt00m +h77\\"; my($robot_file)="\\\\dt00mx84\\LogArchive\\Webrobots.txt"; my($second, $minute, $hour, $dayOfMonth, $month, $yearOffset, $day +OfWeek, $dayOfYear, $daylightSavings) = localtime(); my$x="ex"; my($extension)=".log"; my($year)=1900+$yearOffset; my($month_new)=1+$month; my $day=$dayOfMonth-1; my @logmessages; # build filename format # this is how log files are named using the yy/mm/dd formatex040101.lo +g if (length($month_new)< 2) { $month_new="0".$month_new }; if (length($dayOfMonth)< 2) { $dayOfMonth="0".$dayOfMonth }; # build input file name and file path to read from my($filename)=$x.substr($year,2).$month_new.$day; my($log_file)=$log_file_path.$filename.$extension; # build output file name and path to write file my($file_name)=substr($year,2).$month_new.$day; my($out_file)=$log_file_path.$file_name.$extension; # Declare the FileHandles and open the input and output files my $fh= new FileHandle; open(LOG, "<$log_file") or die "Could not open file"; @logmessages=<LOG>; close(LOG); my $outfile= new FileHandle; open(OUTFILE, ">$out_file") or die "Could not open file"; foreach $LogLine (@logmessages){ if(($LogLine!~/Slurp/) && ($LogLine!~/Jeeves/)&&($LogLine!~/Go +oglebot/)&&($LogLine!~/FunWebProducts/)&&($LogLine!~/msnbot.htm/)&&($ +LogLine!~/PeoplePal/)&&($LogLine!~/ventura5/)&&($LogLine!~/Speedy/)&& +($LogLine!~/GovDelivery/)&&($LogLine !~ /gif/)&&($LogLine !~ /jpg/)&& +($LogLine !~ /ico/)&&($LogLine !~ /css/)&&($LogLine !~ /js/)&&($LogLi +ne !~ /archive/)&&($LogLine !~ /CazoodleBot/)&&($LogLine !~ /WebTrend +s/)&&($LogLine !~ /ShopWiki/)&&($LogLine !~ /Ultraseek/)&&($LogLine ! +~ /msrbot/)&&($LogLine !~ /Moskow/)&&($LogLine !~ /Gigabot/)) { print OUTFILE $LogLine; } }
        I wanted to post it for more comments and advice on how to improve ;)

        Well, I can see several areas for improvement, notably the huge if block at the end.

        (BTW, why do you declare a variable $robot_file that you never subsequently use? And why have you commented out use strict; at the beginning of your program?)

        Immediately, however, one thing that springs to mind is this line:

        my $day=$dayOfMonth-1;

        It seems to me, in the context, that you're trying here to use this to get the date of the previous day. If I'm wrong, please excuse me, but if not, what do you think will happen, for example on November 1st (not to mention Jan 1st, or March 1st, when the previous day might be either Feb 28 or Feb 29)?

        In fact, 'How do I find yesterday's date?' is a faq (How do I find yesterday's date?). With Perl (not PERL, by the way), there are modules out there that deal with heaps of similar problems, and save you from reinventing the proverbial wheel :).

        Here's one way, based on the answers to the faq, to create the sort of string that you seem to require, using the DateTime module (Don't hesitate, BTW, to use whitespace in your code, to make it more human-legible):

        use DateTime; my $yesterday = DateTime->today->subtract( days => 1 )->ymd( '' ); my ( $prefix, $extension ) = ( 'ex', '.log' ); my $log_file = $prefix . substr( $yesterday, 2 ) . $extension;
        Why not use grep as Moritz pointed out? If you loaded the robot file into the @robots array, you would need to remove the newlines like chomp(@robots) before you used it in the grep. Your date string could be stated like:
        my ($day, $month, $year) = (localtime)[3..5]; # $date is YYMMDD format - you may want $day - 1? my $date = sprintf "%02d%02d%02d", $year % 100, $month + 1, $day; my $log_file_path = '/dt00mx84/LogArchive/www.ksdot.org/dt00mh77/'; my $log_file = $log_file_path . 'ex' . $date. '.log'; my $out_file = $log_file_path . $date. '.log';

        Chris

        Update: Not_a_Number nailed it. Didn't think about day-1 being yesterday.

Re: Deleting a matching string in an array
by Jenda (Abbot) on Oct 31, 2007 at 21:15 UTC

    You keep on declaring and initializing lexical filehandles, but never use them!

    my $fh= new FileHandle; open(LOG, "<$log_file") or die "Could not open file";

    You should decide. Either the oldstyle filehandles or the lexical ones. Also it's not needed to precreate the filehandles with new FileHandle, just use

    open(my $LOG, "<$log_file") or die "Could not open file"; @logmessages=<$LOG>; close($LOG);
    and if you do not use a perl that's more than 5 (or something like that) years old you will be just fine.