Re: Finding pages without specific words

You are checking the phrases on a line by line base. Try File::Slurp like so:

use File::Slurp;
my $text = read_file( 'filename' ) ;
[download]

and use something like:

if (($text !~ m/$nameOne/) && ($text !~ m/$nameTwo))
{
  print "$name\n";
  $ct++;
}
[download]

Comment on Re: Finding pages without specific words Select or Download Code

Replies are listed 'Best First'.
Re: Re: Finding pages without specific words by Crian (Curate) on Mar 08, 2004 at 13:00 UTC
You are right. It could be a good idea to quote the search strings in case you ever put something into it with a special meaning in regular expressions. just use `if ($text !~ m/\Q$nameOne\E/ and $text !~ m/\Q$nameTwo\E/) { print "$name\n"; ++$ct; }` [download] (I know, you don't really need the \E, but IMHO it's nicer with them.)	[reply] [d/l]
Re: Re: Re: Finding pages without specific words by UnderMine (Friar) on Mar 08, 2004 at 13:48 UTC
If speed is important then a more complex single pre-evaluated regex may be faster for you. But it wasn't with this machine `my $regex=qr{(?:(?:\Q$nameOne\E).(?:\Q$nameTwo\E)\|(?:\Q$nameTwo\E).( +?:\Q$nameOne\E))}; if ($text !~ $regex) { print "$name\n"; ++ct; }` [download] An Example benchmark test script (ugly but compares the two methods) `use Benchmark; $n1='h'; $n2='e'; $r=qr{(?:(?:\Q$n1\E).(?:\Q$n2\E)\|(?:\Q$n2\E).(?:\Q$n1\E))}; @i=qw(hello how are you doing are you going to exit now? each time?); timethese (10000, { var=> sub { for $w (@i) { $ct1++ if ($w!~$r); } }, and => sub { for $w (@i) { $ct2++ if ($w!~ m/\Q$n1\E/ or $w !~ m/\Q$n2\E/); } } }); print "$ct1, $ct2\n";` [download] Benchmark: timing 10000 iterations of and, var... and: 2 wallclock secs ( 0.73 usr + 0.00 sys = 0.73 CPU) @ 13698.63/s (n=10000) var: 1 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU) @ 9259.26/s (n=10000) 110000, 110000 Of course this benchmark figure is not based on real data so it is better to use it on the actual data to get a real indication. Hope it helps UnderMine	[reply] [d/l] [select]