Optimizing a regex

ZydecoSue has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Optimizing a regex (at a tangent) by Malkavian (Friar) on Jan 29, 2001 at 22:09 UTC
Another little hint, though not directly to do with the regex itself: A lot of the overhead in a large document is to do with reading line by line in a while construct. If you compose a routine to use the call 'read' to read in blocks of data at a time (set the size to something meaningful, according to your doc size. I use about 30k for a log reader I wrote). This does mean keeping track of split lines, and subsequent recombination of these between large data passes, but that's happily resolved using rindex to find the last new line character on a line, and buffering that for inclusion in subsequent reads. However, once you get round this extra bit of coding, you end up being able to do your search in a multiline regex, without a lot of the iteration overhead. Using this technique, along with pre-compiled regexes, a log reader here has been optimised from around a 5 min run time on a set of data down to 1 min 20 secs. Anyhow, this is just a little addendum to other comments here, and although indirect, it may help a little in the long run. Cheers, Malk	[reply]
Re: Optimizing a regex by KM (Priest) on Jan 29, 2001 at 21:46 UTC
$pt == "" and $pc == "" should use eq, not ==. == is for numerical tests. What exactly are your looking to match? Would all things look a certain way, like: pagetitle "My title" or $pagetitle "My title" A good place to see a use of qr// is the Untaint.pm module, which uses it. But, it is basically like so: `my $qr = qr!\d{3}\s+\w!; if ($str =~ /$qr/) { .. etc... }` [download] Also, see the Regex Quote-Like Operators sections in perlop. When you say you can't get it to work properly, what is not working? What have you tried? Cheers, KM	[reply] [d/l]
Re: Optimizing a regex by lemming (Priest) on Jan 29, 2001 at 21:47 UTC
Well, one way to optimise it would be to add `last if ($pt ne "" && $pc eq "");` That will get you out of the loop once you've found what you're looking for. (cue U2) You also want to use eq instead of == if you're testing for "". This code prints hello. `$ha = "hi"; if ($ha == "") { print "hello\n"; }` [download] If doing numeric comparisons, strings with no digits are equal to 0. So $ha equals 0 as does "". If you turn on warnings this will be pointed out. Now if no-one else has done your regex, I'll look at that more closely.	[reply] [d/l] [select]
Re: Optimizing a regex by runrig (Abbot) on Jan 29, 2001 at 22:07 UTC
qr// won't really help in this case because your regex's are constant, i.e., they have no variables in them. You might want to optimize the process, though, by adding a terminating condition so you don't have to process the whole file (if this is all the processing you need, that is). BTW, '==' is for numerical comparisons, 'eq' is for character comparisons.: `my ($pt, $pc); while (<DATAFILE>){ $pt = $1 if !defined $pt and /pagetitle.?"(.?)"/i; $pc = $1 if !defined $pc and /category.?"(.?)"/i; last if defined $pt and defined $pc; }` [download]	[reply] [d/l]
Re: Optimizing a regex by ZydecoSue (Scribe) on Jan 29, 2001 at 22:07 UTC
No, that's not the entire loop, just a hastily edited version. What you're not seeing is an edited form of the article I mentioned. It walks through all source documents, checks their file dates, opens them into DATAFILE, does this, and then closes them. Since we're talking several thousand documents, optimization is a concern. As far as what I want to match, the second, though the first works, too. The documents assign values for page title, site category, and so on. The actual content is a here document. It's a crude form of ASP. I'm ignoring the leading $ because the variable names don't appear in the content. Can the expression assigned to your $qr contain an interpolated reference? I would be happy to post my current attempts, but they don't compile and I'm sure that if I see an example, I'll understand why they're not working. And thanks to lemming for catching the eq problem and for seeing what I was trying to accomplish. Once I've found a match, I don't want to look for more. Thank you for replying so nicely. It's nice to see that not everyone is a jerk. Update #1 - Just saw runrig's reply. I think it can help because I want one regex that I call twice, where the variable portion is the name of the variable I'm searching for. Update #2 - Just saw what probably prompted runrig's reply. there's a mistake in the code I posted. This should be clearer: `while (<DATAFILE>){ if ($pt eq "") {if (/pagetitle.?"(.?)"/i){$pt = $1;}} if ($pc eq "") {if ( /category.?"(.?)"/i){$pc = $1;}} }` [download] Okay, I cheated...just to show that I was listening. :) Update #3 - I'm not too worried about the size of the data file, since the first several lines are variable declarations that ensure the right HTML snippets are used and to brand the page. Once I have the values I'm after, I bail out of the while loop and move on to the next file. but, I'll file the suggestion for later use. :)	[reply] [d/l]
Re: Re: Optimizing a regex by runrig (Abbot) on Jan 29, 2001 at 22:21 UTC
An example of using qr might be (BTW, you don't need to escape quotes in a regex): `my %search, %found; for (qw(pagetitle category)) { $search{$_} = qr/$_.?"(.?)"/i; } my ($pt, $pc); while (<DATAFILE>){ my $line = $_; for (keys %search) { $found{$_} = $1 if !exists $found{$_} and $line =~ $search{$_}; } }` [download]	[reply] [d/l]
Re: Optimizing a regex by stefan k (Curate) on Jan 29, 2001 at 21:46 UTC
Hi, if that constructs does exactly what it needs to, I think you're probably fine off using it anyway (uhm, OK, I posted the question concerning a Profiler an hour ago grin.). I very rarely get to points when I got to think about performance and thus prefer rapid development and 'saying what I mean'. Are you in need of good performance here or is this point just reached once or twice (at least less that -say- a hundred times) during your run? Regards Stefan K `$dom = "skamphausen.de"; ## May The Open Source Be With You! $Mail = "mail@$dom; $Url = "http://www.$dom";` [download]	[reply] [d/l]
Re: Optimizing a regex by dws (Chancellor) on Jan 30, 2001 at 00:01 UTC
Use one regex instead of two, and stop once you have both. `while ( <DATAFILE> ) { m/\$(pagetitle\|category)\s=\s\"(.*?)\"/i or next; my $which = lc($1); $pt = $2 if $pt ne "" and $which eq "pagetitle"; $pc = $2 if $pc ne "" and $which eq "category"; last if $pt ne "" and $pc ne ""; }` [download]	[reply] [d/l]
Re: Optimizing a regex by petral (Curate) on Jan 30, 2001 at 01:12 UTC
If you know the stuff is all in the top few lines and you _know_ that the top few lines are, less than (say) 16k, then Malkavian's suggestion will help most: `read(DATAFILE, $_, 16384); ($pt) = /pagetitle.?"(.?)"/mi; ($pc) = /category.?"(.?)"/im;` [download] This gets you the first of each, if there is one, with no loop at all. (Your code tests each var on each loop to make sure it's not the second entry for that item. Do you need to do that?) p	[reply] [d/l]