Re: Comparing pattern
by bv (Friar) on Sep 18, 2009 at 14:44 UTC
|
Look at study, especially if you have a lot of patterns you are matching against. Also, use the 3-argument form of open whenever possible (though that won't speed up your program any.) If you plan to extend this to multiple files, you should precompile your regexps with qr//. Also, take a look at setting local $/ for file slurping, which will be faster than reading lines and joining.
print pack("A25",pack("V*",map{1919242272+$_}(34481450,-49737472,6228,0,-285028276,6979,-1380265972)))
| [reply] [d/l] [select] |
|
|
Thanks for reply! I want to use this subroutine to scan many files.
With only few patterns, speed is increased considerably. I must find the way to load all patterns in one shot without that loop. I'm learning now.
| [reply] |
|
|
@bv
Look at study, especially if you have a lot of patterns you are matching against.
I have added study; for both sentences, but I can't see any difference when scanning multiple files using this subroutine. Please advice.
#!/usr/bin/perl -w
use strict;
my $patterns = "/path/to/patterns.txt";
my $arg1 = shift;
open (PAT, '<', $patterns) or die "$patterns: $!\n";
my @patterns = <PAT>;
study;
close(PAT);
chomp @patterns;
my $regex_string = join '|', @patterns;
open( FILE, "<", "$arg1") or die "$arg1: $!\n";
$_ = do { local $/; <FILE> };
study;
close(FILE);
if ( /($regex_string)/is ) {print "\n$arg1\n$1\n";}
| [reply] [d/l] |
|
|
Did you read the documentation on study?
study attempts to make matches against a string more efficient, but incurs a one-time penalty for the time spent studying the string. It is most beneficial when you are doing many matches against a single string. You should benchmark to determine if you are getting any benefit from study. The first study in your code (line 10) is unnecessary, since you don't have a string in $_ to match against.
You keep saying "subroutine." Is this really in a sub? If so, are you reading in your patterns every time the sub is run? There's a major inefficiency. And once you solve that one, you can look at precompiling your expressions like I originally suggested.
print pack("A25",pack("V*",map{1919242272+$_}(34481450,-49737472,6228,0,-285028276,6979,-1380265972)))
| [reply] [d/l] [select] |
|
|
Re: Comparing pattern
by graff (Chancellor) on Sep 19, 2009 at 02:39 UTC
|
Have you tried it like this?
my $patterns = "/path/to/file.txt";
my $arg1 = shift;
open( PATTERNS, "<", $patterns ) or die "$patterns: $!\n";
my @list_patterns = <PATTERNS>;
close PATTERNS;
chomp @list_patterns;
my $list_regex = join '|', @list_patterns;
open( FILE, "<", $arg1 ) or die "$arg1: $!\n";
while (<FILE>) {
if ( /($list_regex)/ ) {
print "\n$arg1\n$1\n";
}
}
If your list of patterns does not include anything that tries to match across a line-break (i.e.: "...\n..."), then you don't need to slurp your whole "arg1" file content into memory at one time. Depending on the files that you are searching through, that can save time by avoiding memory swaps, and depending on what sort of patterns you are looking for, applying the regex to a small string (one line at a time) could be a lot faster than applying it to a whole file.
If your patterns do involve matching across line breaks, loading them all into a single regex (joining them together with "|") will probably speed things up anyway, because you only do one regex match against the whole string. | [reply] [d/l] |
|
|
Thanks a lot guys!!! Your both suggestions helped me to solve this problem.
Based on bv suggestion regarding qr//, I also found a great post: http://www.perlmonks.org/?node_id=661292.
That Regexp::Assemble do a great job.
| [reply] |
|
|
Regexp::Assemble would not track the original pattern correctly. I give up, it's too hard for me :(
use Regexp::Assemble;
my $patterns = "/path/to/file.txt";
my $list_regex = Regexp::Assemble->new(file => $patterns);
$list_regex->track( 1 );
open( FILE, "<", "$arg1") or die "$arg1: $!\n";
while (<FILE>) {
if (/$list_regex/) {print "\n$arg1\n$list_regex->matched\n";}
}
close(FILE);
}
Now I have this code but I'm facing a new problem.
In my first example I use both flags /is
I need /s so . to match newlines as well.
I have this
my $patterns = "/path/to/file.txt";
my $arg1 = shift;
open( PATTERNS, "<", $patterns ) or die "$patterns: $!\n";
my @list_patterns = <PATTERNS>;
close PATTERNS;
chomp @list_patterns;
my $regexStr = "(" . join("|", @list_patterns) . ")";
my $list_regex = qr{$regexStr}i;
open( FILE, "<", "$arg1") or die "$arg1: $!\n";
while (<FILE>) {
if (/$list_regex/) {print "\n$arg1\n$1\n";}
}
close(FILE);
Adding s to both
my $list_regex = qr{$regexStr}is;
or
if (/$list_regex/is)
would not solve the problem.
part1.*part2
This pattern working with my original script. .* should match also newline.
fggffgfg
part1
hghggh
ghhggh
hggh
part2
ytyty
This is the last problem, else the script working perfectly and much faster thanks to your advices. I will next take a look at local $/. | [reply] [d/l] [select] |
|
|
|
|
| [reply] |
|
|
graff, based on your suggestion my script working pretty well now.
Thank you once again!
However, I'm not able to track down original pattern as $1 is no longer working as you suggested in first example.
print "\n$arg1\n$1\n";
I have 1000+ patterns and I will add more.
I need this scanner to fight against those bloody spammers.
See this example file, but with thousands of links and similar lines.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://test.com/index.html</loc>
<lastmod>2009-08-21</lastmod>
<changefreq>monthly</changefreq>
<priority>0.7</priority>
</url>
<url>
<loc>http://test.com/page_1.html</loc>
<lastmod>2009-08-06</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://test.com/page_2.html</loc>
<lastmod>2009-08-10</lastmod>
<changefreq>monthly</changefreq>
<priority>0.4</priority>
</url>
I don't know why patterns like these (each on newline) will fail against this file. The script enter an interminable loop, using an entire cpu core.
<a href=.*<a href=.*poker
page.{0,10}.html.*page.{0,10}.html.*page.{0,10}.html.*map
When I change them to
<a href=.*<a href
page.{0,10}.html.*page.{0,10}.html.*page.{0,10}.html
working without problem. | [reply] [d/l] [select] |
|
|
| [reply] |
|
|
|
|
|
|
|
|
Graff, you're an awesome guy!! The final version of the script working exactly in the manner I want except a little problem.
I will elaborate it.
I have this file to scan: example.txt (used by scammers/abusers)
http://uploading.com/files/get/cm3364a5/
This pattern
page.{0,10}html.*?page.{0,10}html.*?</changefreq
working because example.txt contain all elements of pattern.
If I change pattern to
page.{0,10}html.*?page.{0,10}html.*?kkk
(example.txt doesn't containt kkk), script enters a loop and CPU usage become very high.
I think it's related to excesive number of the same pattern, because reducing example.txt to only few lines solves the problem.
Any idea how to solve this bug? | [reply] [d/l] [select] |