I am pretty hopeless at regexps; my eyes glaze over at anything more complicated than (?:foo). That said, I have a performance problem. I'm looking at several gigabytes of log files and I want to extract "interesting" lines. Stripping the problem down to the absolute barebones, the problem looks like this:
#! /usr/bin/perl -w my $target = shift || 'word'; my $re = qr/a=<(.*?$target.*?)>/; while( <DATA> ) { print "[$1]\n" if /$re/; } __DATA__ a=<wordy> b=<rappinghood> a=<thisword> b=<thatword> a=<foreword> b=<junk> a=<nothing> b=<word> b=<wordplay> a=<swords> b=<end>
In other words, I'm interesting in a= lines that contain word (or whatever I specify) anywhere between angle brackets. If I do see it, then I want everything between the angle brackets. (And in other variants I might be interesting in a= and b=, but that's another story). The above program emits:
[wordy] [thisword] [foreword] [swords]
I'm using .*? to elide the remaining characters between the my target and delimiters, however, when I look at the behaviour of the match engine (with use re 'debug') I see lots of backtracking.
My question is this: is it possible to write a regexp for this problem that does not involve backtracking? The other question is whether it improves performance to any significant degree...
The corollary is that, hopefully, given I understand the problem domain, the solution might help me understand regexps better, and possibly be able to transpose it to other problems in the future.
Thanks.
Update: (meta-note: hooray for being able to update SoPWs!) Duh! Seconds after creating this node, it of course dawns on me that I can write
my $re = qr/a=<(.*?$target[^>]*)>/;This is probably half the issue... but what about the leading characters between the opening delimiter and the target?
In reply to Removing backtracking from a .*? regexp by grinder
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |