Finding pages without specific words

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding pages without specific words by Jaap (Curate) on Mar 08, 2004 at 12:52 UTC
You are checking the phrases on a line by line base. Try File::Slurp like so: `use File::Slurp; my $text = read_file( 'filename' ) ;` [download] and use something like: `if (($text !~ m/$nameOne/) && ($text !~ m/$nameTwo)) { print "$name\n"; $ct++; }` [download]	[reply] [d/l] [select]
Re: Re: Finding pages without specific words by Crian (Curate) on Mar 08, 2004 at 13:00 UTC
You are right. It could be a good idea to quote the search strings in case you ever put something into it with a special meaning in regular expressions. just use `if ($text !~ m/\Q$nameOne\E/ and $text !~ m/\Q$nameTwo\E/) { print "$name\n"; ++$ct; }` [download] (I know, you don't really need the \E, but IMHO it's nicer with them.)	[reply] [d/l]
Re: Re: Re: Finding pages without specific words by UnderMine (Friar) on Mar 08, 2004 at 13:48 UTC
If speed is important then a more complex single pre-evaluated regex may be faster for you. But it wasn't with this machine `my $regex=qr{(?:(?:\Q$nameOne\E).(?:\Q$nameTwo\E)\|(?:\Q$nameTwo\E).( +?:\Q$nameOne\E))}; if ($text !~ $regex) { print "$name\n"; ++ct; }` [download] An Example benchmark test script (ugly but compares the two methods) `use Benchmark; $n1='h'; $n2='e'; $r=qr{(?:(?:\Q$n1\E).(?:\Q$n2\E)\|(?:\Q$n2\E).(?:\Q$n1\E))}; @i=qw(hello how are you doing are you going to exit now? each time?); timethese (10000, { var=> sub { for $w (@i) { $ct1++ if ($w!~$r); } }, and => sub { for $w (@i) { $ct2++ if ($w!~ m/\Q$n1\E/ or $w !~ m/\Q$n2\E/); } } }); print "$ct1, $ct2\n";` [download] Benchmark: timing 10000 iterations of and, var... and: 2 wallclock secs ( 0.73 usr + 0.00 sys = 0.73 CPU) @ 13698.63/s (n=10000) var: 1 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU) @ 9259.26/s (n=10000) 110000, 110000 Of course this benchmark figure is not based on real data so it is better to use it on the actual data to get a real indication. Hope it helps UnderMine	[reply] [d/l] [select]
Re: Finding pages without specific words by grinder (Bishop) on Mar 08, 2004 at 13:32 UTC
The slurp technique shown is okay if you have small files. I know I have servers where slurping the biggest disk file would make a serious dent on the RAM, enough to start it swapping into oblivion. The other thing I'm not sure of is whether these patterns occur on the same line, or on different lines. The former case is easy to solve, the latter case involves seeing whether you've seen one after having seen the other: Assuming you have a variable named `$lacking_both` that keeps count, your file read loop would look like: `my $saw_alpha = 0; my $saw_charlie = 0; while( <DAT> ) { if( /alpha/ ) { ++$saw_alpha; last if $saw_charlie; } if( /charlie/ ) { ++$saw_charlie; last if $saw_alpha; } } ++$lacking_both unless $saw_alpha and $saw_charlie;` [download] For a once off this will do, but it might be worth generalising this to use a hash instead of discrete scalars if you need to look for an arbitrary number of patterns. I'd probably also be tempted to `push @lacking_both, $name` [download] So that `@lacking_both` in a scalar context gives you the number of files lacking both patterns, but also the names of the files themselves, because you are probably going to need that information somewhere down the track anyway. update: another thing I just thought of: a big strike against slurping the file is that you might slurp the whole file, only to find out that you hit both patterns in the first 10 lines of the file. In which case a short-circuiting `last`, as shown above, will be a big win.	[reply] [d/l] [select]
Re: Finding pages without specific words by Wonko the sane (Curate) on Mar 08, 2004 at 12:57 UTC
Hello Your if tests are also using the wrong operator. `if ( $line !~ /$nameOne/i )` [download] '!~' rather than '!=' Running with Warnings enabled would have shown this error. :) Wonko	[reply] [d/l]
Re: Finding pages without specific words by Crian (Curate) on Mar 08, 2004 at 13:35 UTC
I have played around with your program a little bit, here is my actual version, perhaps it is usefull for you: #!/usr/bin/perl use strict; use warnings; use File::Find; my $dir = '\webdirectory'; my $nameOne = 'alpha beta'; my $nameTwo = 'charlie'; my @pages; sub mySub { return if -d $File::Find::name; return if $File::Find::name !~ /\.html$/; my $text; open (IN, $File::Find::name) or die "Can't open '$File::Find::name +': $!\n"; { local $/; # Slurp-Mode $text = <IN>; } close IN or die $!; if ($text !~ m/\Q$nameOne\E/ and $text !~ m/\Q$nameTwo\E/) { push @pages, $File::Find::name; } } find( \&mySub, $dir ); print "$_\n" for @pages; [download] I'm sorry for not using File::Slurp here as suggested, but its not installed here until now.	[reply] [d/l]
Re: Re: Finding pages without specific words by Anonymous Monk on Mar 08, 2004 at 15:48 UTC
Thanks to everyone! What is local $/ doing? `open (IN, $File::Find::name) or die "Can't open '$File::Find::name': $ +!\n"; { local $/; # Slurp-Mode means what and what is local $/ d +oing?? $text = <IN>; }` [download]	[reply] [d/l]
Re: Re: Re: Finding pages without specific words by Happy-the-monk (Canon) on Mar 11, 2004 at 13:56 UTC
in `perldoc perlvar` you can find out more about $/ and the "slurp mode". Look for "input_record_separator" Sören	[reply]
Re: Finding pages without specific words by UnderMine (Friar) on Mar 08, 2004 at 14:13 UTC
`while($line = <DAT>) { if ($line != /$nameOne/i) { if ($line != /$nameTwo/i) { print "$name\n"; $ct++; last; } } }` [download] This make the assumption that $nameOne and $nameTwo are on the same line. Is this what you really wanted? Thanks UnderMine	[reply] [d/l]
Re: Re: Finding pages without specific words by Anonymous Monk on Mar 08, 2004 at 15:50 UTC
No, I didnt want to always check if $nameOne and $nameTwo are on the same line.	[reply]
Re: Finding pages without specific words by Crian (Curate) on Mar 08, 2004 at 13:08 UTC
An other point: If speed is important, don't put variables from outside into the regular expressions.	[reply]