Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to search my web pages and find all pages that do NOT have two specific phrases:
alpha beta charlie
If the page doesnt have alpha beta AND charlie then add it to the counter. Here is my attempt and not working because I am not fetching the correct pages:
use strict; use File::Find; my $dir = '\webdirectory'; my ($line, $name); my $nameOne = 'alpha beta'; my $nameTwo = 'charlie'; my $ct = 0; sub mySub { $name = $File::Find::name; open ( DAT, $name ) || warn "Can\'t open File $name: $!\n"; while($line = <DAT>) { if ($line != /$nameOne/i) { if ($line != /$nameTwo/i) { print "$name\n"; $ct++; last; } } } close DAT; } find( \&mySub, $dir ); print "Total pages without the 'alpha beta' AND 'charlie' = $ct\n";

Replies are listed 'Best First'.
Re: Finding pages without specific words
by Jaap (Curate) on Mar 08, 2004 at 12:52 UTC
    You are checking the phrases on a line by line base. Try File::Slurp like so:
    use File::Slurp; my $text = read_file( 'filename' ) ;
    and use something like:
    if (($text !~ m/$nameOne/) && ($text !~ m/$nameTwo)) { print "$name\n"; $ct++; }
      You are right.

      It could be a good idea to quote the search strings in case you ever put something into it with a special meaning in regular expressions.

      just use

      if ($text !~ m/\Q$nameOne\E/ and $text !~ m/\Q$nameTwo\E/) { print "$name\n"; ++$ct; }


      (I know, you don't really need the \E, but IMHO it's nicer with them.)
        If speed is important then a more complex single pre-evaluated regex may be faster for you. But it wasn't with this machine
        my $regex=qr{(?:(?:\Q$nameOne\E).*(?:\Q$nameTwo\E)|(?:\Q$nameTwo\E).*( +?:\Q$nameOne\E))}; if ($text !~ $regex) { print "$name\n"; ++ct; }
        An Example benchmark test script (ugly but compares the two methods)
        use Benchmark; $n1='h'; $n2='e'; $r=qr{(?:(?:\Q$n1\E).*(?:\Q$n2\E)|(?:\Q$n2\E).*(?:\Q$n1\E))}; @i=qw(hello how are you doing are you going to exit now? each time?); timethese (10000, { var=> sub { for $w (@i) { $ct1++ if ($w!~$r); } }, and => sub { for $w (@i) { $ct2++ if ($w!~ m/\Q$n1\E/ or $w !~ m/\Q$n2\E/); } } }); print "$ct1, $ct2\n";
        Benchmark: timing 10000 iterations of and, var...
        and: 2 wallclock secs ( 0.73 usr + 0.00 sys = 0.73 CPU) @ 13698.63/s (n=10000)
        var: 1 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU) @ 9259.26/s (n=10000)
        110000, 110000
        Of course this benchmark figure is not based on real data so it is better to use it on the actual data to get a real indication.
        Hope it helps
        UnderMine
Re: Finding pages without specific words
by grinder (Bishop) on Mar 08, 2004 at 13:32 UTC

    The slurp technique shown is okay if you have small files. I know I have servers where slurping the biggest disk file would make a serious dent on the RAM, enough to start it swapping into oblivion.

    The other thing I'm not sure of is whether these patterns occur on the same line, or on different lines. The former case is easy to solve, the latter case involves seeing whether you've seen one after having seen the other:

    Assuming you have a variable named $lacking_both that keeps count, your file read loop would look like:

    my $saw_alpha = 0; my $saw_charlie = 0; while( <DAT> ) { if( /alpha/ ) { ++$saw_alpha; last if $saw_charlie; } if( /charlie/ ) { ++$saw_charlie; last if $saw_alpha; } } ++$lacking_both unless $saw_alpha and $saw_charlie;

    For a once off this will do, but it might be worth generalising this to use a hash instead of discrete scalars if you need to look for an arbitrary number of patterns.

    I'd probably also be tempted to

    push @lacking_both, $name

    So that @lacking_both in a scalar context gives you the number of files lacking both patterns, but also the names of the files themselves, because you are probably going to need that information somewhere down the track anyway.

    update: another thing I just thought of: a big strike against slurping the file is that you might slurp the whole file, only to find out that you hit both patterns in the first 10 lines of the file. In which case a short-circuiting last, as shown above, will be a big win.

Re: Finding pages without specific words
by Wonko the sane (Curate) on Mar 08, 2004 at 12:57 UTC
    Hello

    Your if tests are also using the wrong operator.

    if ( $line !~ /$nameOne/i )

    '!~' rather than '!='

    Running with Warnings enabled would have shown this error. :)

    Wonko

Re: Finding pages without specific words
by Crian (Curate) on Mar 08, 2004 at 13:35 UTC
    I have played around with your program a little bit, here is my actual version, perhaps it is usefull for you:

    #!/usr/bin/perl use strict; use warnings; use File::Find; my $dir = '\webdirectory'; my $nameOne = 'alpha beta'; my $nameTwo = 'charlie'; my @pages; sub mySub { return if -d $File::Find::name; return if $File::Find::name !~ /\.html$/; my $text; open (IN, $File::Find::name) or die "Can't open '$File::Find::name +': $!\n"; { local $/; # Slurp-Mode $text = <IN>; } close IN or die $!; if ($text !~ m/\Q$nameOne\E/ and $text !~ m/\Q$nameTwo\E/) { push @pages, $File::Find::name; } } find( \&mySub, $dir ); print "$_\n" for @pages;


    I'm sorry for not using File::Slurp here as suggested, but its not installed here until now.
      Thanks to everyone! What is local $/ doing?
      open (IN, $File::Find::name) or die "Can't open '$File::Find::name': $ +!\n"; { local $/; # Slurp-Mode means what and what is local $/ d +oing?? $text = <IN>; }

        in perldoc perlvar you can find out more about $/ and the "slurp mode".
        Look for "input_record_separator"

        Sören

Re: Finding pages without specific words
by UnderMine (Friar) on Mar 08, 2004 at 14:13 UTC
    while($line = <DAT>) { if ($line != /$nameOne/i) { if ($line != /$nameTwo/i) { print "$name\n"; $ct++; last; } } }
    This make the assumption that $nameOne and $nameTwo are on the same line. Is this what you really wanted?

    Thanks
    UnderMine

      No, I didnt want to always check if $nameOne and $nameTwo are on the same line.
Re: Finding pages without specific words
by Crian (Curate) on Mar 08, 2004 at 13:08 UTC
    An other point: If speed is important, don't put variables from outside into the regular expressions.