Re: Optimizing a regex (at a tangent)
by Malkavian (Friar) on Jan 29, 2001 at 22:09 UTC
|
Another little hint, though not directly to do with the regex itself:
A lot of the overhead in a large document is to do with reading line by line in a while construct.
If you compose a routine to use the call 'read' to read in blocks of data at a time (set the size to something meaningful, according to your doc size. I use about 30k for a log reader I wrote).
This does mean keeping track of split lines, and subsequent recombination of these between large data passes, but that's happily resolved using rindex to find the last new line character on a line, and buffering that for inclusion in subsequent reads.
However, once you get round this extra bit of coding, you end up being able to do your search in a multiline regex, without a lot of the iteration overhead. Using this technique, along with pre-compiled regexes, a log reader here has been optimised from around a 5 min run time on a set of data down to 1 min 20 secs.
Anyhow, this is just a little addendum to other comments here, and although indirect, it may help a little in the long run.
Cheers,
Malk | [reply] |
Re: Optimizing a regex
by KM (Priest) on Jan 29, 2001 at 21:46 UTC
|
$pt == "" and $pc == "" should use eq, not ==.
== is for numerical tests.
What exactly are your looking to match? Would all things look a certain way, like:
pagetitle "My title"
or
$pagetitle "My title"
A good place to see a use of qr// is the Untaint.pm module, which uses it. But, it is basically like so:
my $qr = qr!\d{3}\s+\w!;
if ($str =~ /$qr/) {
.. etc...
}
Also, see the Regex Quote-Like Operators sections in perlop.
When you say you can't get it to work properly, what is not working? What have
you tried?
Cheers,
KM | [reply] [d/l] |
Re: Optimizing a regex
by lemming (Priest) on Jan 29, 2001 at 21:47 UTC
|
Well, one way to optimise it would be to add
last if ($pt ne "" && $pc eq "");
That will get you out of the loop once you've found
what you're looking for. (cue U2)
You also want to use eq instead of == if you're testing
for "". This code prints hello.
$ha = "hi";
if ($ha == "") { print "hello\n"; }
If doing numeric comparisons, strings with no digits are
equal to 0. So $ha equals 0 as does "". If you turn on
warnings this will be pointed out.
Now if no-one else has done your regex, I'll look at that
more closely.
| [reply] [d/l] [select] |
Re: Optimizing a regex
by runrig (Abbot) on Jan 29, 2001 at 22:07 UTC
|
qr// won't really help in this case because your regex's are constant, i.e., they have no variables in them. You might want to optimize the process, though, by adding a terminating condition so you don't have to process the whole file (if this is all the processing you need, that is). BTW, '==' is for numerical comparisons, 'eq' is for character comparisons.: my ($pt, $pc);
while (<DATAFILE>){
$pt = $1 if !defined $pt and /pagetitle.*?"(.*?)"/i;
$pc = $1 if !defined $pc and /category.*?"(.*?)"/i;
last if defined $pt and defined $pc;
}
| [reply] [d/l] |
Re: Optimizing a regex
by ZydecoSue (Scribe) on Jan 29, 2001 at 22:07 UTC
|
No, that's not the entire loop, just a hastily edited version. What you're not seeing is an edited form of the article I mentioned. It walks through all source documents, checks their file dates, opens them into DATAFILE, does this, and then closes them. Since we're talking several thousand documents, optimization is a concern.
As far as what I want to match, the second, though the first works, too. The documents assign values for page title, site category, and so on. The actual content is a here document. It's a crude form of ASP.
I'm ignoring the leading $ because the variable names don't appear in the content.
Can the expression assigned to your $qr contain an interpolated reference?
I would be happy to post my current attempts, but they don't compile and I'm sure that if I see an example, I'll understand why they're not working.
And thanks to lemming for catching the eq problem and for seeing what I was trying to accomplish. Once I've found a match, I don't want to look for more.
Thank you for replying so nicely. It's nice to see that not everyone is a jerk.
Update #1 - Just saw runrig's reply.
I think it can help because I want one regex that I call twice, where the variable portion is the name of the variable I'm searching for.
Update #2 - Just saw what probably prompted runrig's reply. there's a mistake in the code I posted. This should be clearer:
while (<DATAFILE>){
if ($pt eq "")
{if (/pagetitle.*?"(.*?)"/i){$pt = $1;}}
if ($pc eq "")
{if ( /category.*?"(.*?)"/i){$pc = $1;}}
}
Okay, I cheated...just to show that I was listening. :)
Update #3 - I'm not too worried about the size of the data file, since the first several lines are variable declarations that ensure the right HTML snippets are used and to brand the page. Once I have the values I'm after, I bail out of the while loop and move on to the next file. but, I'll file the suggestion for later use. :)
| [reply] [d/l] |
|
|
An example of using qr might be (BTW, you don't need to escape quotes in a regex):
my %search, %found;
for (qw(pagetitle category)) {
$search{$_} = qr/$_.*?"(.*?)"/i;
}
my ($pt, $pc);
while (<DATAFILE>){
my $line = $_;
for (keys %search) {
$found{$_} = $1 if !exists $found{$_} and $line =~ $search{$_};
}
}
| [reply] [d/l] |
Re: Optimizing a regex
by stefan k (Curate) on Jan 29, 2001 at 21:46 UTC
|
Hi,
if that constructs does exactly what it needs to, I think you're probably
fine off using it anyway (uhm, OK, I posted
the question concerning a Profiler an hour ago *grin*.).
I very rarely get to points when I got to think about performance and thus
prefer rapid development and 'saying what I mean'.
Are you in need of good performance here or is this
point just reached once or twice (at least less that -say- a hundred times) during
your run?
Regards Stefan K
$dom = "skamphausen.de"; ## May The Open Source Be With You!
$Mail = "mail@$dom; $Url = "http://www.$dom";
| [reply] [d/l] |
Re: Optimizing a regex
by dws (Chancellor) on Jan 30, 2001 at 00:01 UTC
|
Use one regex instead of two, and stop once you have both.
while ( <DATAFILE> ) {
m/\$(pagetitle|category)\s*=\s*\"(.*?)\"/i or next;
my $which = lc($1);
$pt = $2 if $pt ne "" and $which eq "pagetitle";
$pc = $2 if $pc ne "" and $which eq "category";
last if $pt ne "" and $pc ne "";
}
| [reply] [d/l] |
Re: Optimizing a regex
by petral (Curate) on Jan 30, 2001 at 01:12 UTC
|
If you know the stuff is all in the top few lines and you _know_ that the top few lines are, less than (say) 16k, then Malkavian's suggestion will help most: read(DATAFILE, $_, 16384);
($pt) = /pagetitle.*?"(.*?)"/mi;
($pc) = /category.*?"(.*?)"/im;
This gets you the first of each, if there is one, with no loop at all. (Your code tests each var on each loop to make sure it's not the second
entry for that item. Do you need to do that?)
p
| [reply] [d/l] |