grinder has asked for the wisdom of the Perl Monks concerning the following question:
I am pretty hopeless at regexps; my eyes glaze over at anything more complicated than (?:foo). That said, I have a performance problem. I'm looking at several gigabytes of log files and I want to extract "interesting" lines. Stripping the problem down to the absolute barebones, the problem looks like this:
#! /usr/bin/perl -w my $target = shift || 'word'; my $re = qr/a=<(.*?$target.*?)>/; while( <DATA> ) { print "[$1]\n" if /$re/; } __DATA__ a=<wordy> b=<rappinghood> a=<thisword> b=<thatword> a=<foreword> b=<junk> a=<nothing> b=<word> b=<wordplay> a=<swords> b=<end>
In other words, I'm interesting in a= lines that contain word (or whatever I specify) anywhere between angle brackets. If I do see it, then I want everything between the angle brackets. (And in other variants I might be interesting in a= and b=, but that's another story). The above program emits:
[wordy] [thisword] [foreword] [swords]
I'm using .*? to elide the remaining characters between the my target and delimiters, however, when I look at the behaviour of the match engine (with use re 'debug') I see lots of backtracking.
My question is this: is it possible to write a regexp for this problem that does not involve backtracking? The other question is whether it improves performance to any significant degree...
The corollary is that, hopefully, given I understand the problem domain, the solution might help me understand regexps better, and possibly be able to transpose it to other problems in the future.
Thanks.
Update: (meta-note: hooray for being able to update SoPWs!) Duh! Seconds after creating this node, it of course dawns on me that I can write
my $re = qr/a=<(.*?$target[^>]*)>/;This is probably half the issue... but what about the leading characters between the opening delimiter and the target?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Removing backtracking from a .*? regexp
by Art_XIV (Hermit) on Nov 17, 2003 at 20:36 UTC | |
by diotalevi (Canon) on Nov 17, 2003 at 20:44 UTC | |
by Art_XIV (Hermit) on Nov 17, 2003 at 21:02 UTC | |
by BrowserUk (Patriarch) on Nov 17, 2003 at 21:30 UTC | |
|
Re: Removing backtracking from a .*? regexp
by ysth (Canon) on Nov 17, 2003 at 18:50 UTC | |
|
Re: Removing backtracking from a .*? regexp
by Anonymous Monk on Nov 17, 2003 at 18:44 UTC | |
|
Re: Removing backtracking from a .*? regexp
by Roger (Parson) on Nov 18, 2003 at 00:45 UTC | |
by Anonymous Monk on Nov 18, 2003 at 01:15 UTC | |
|
Re: Removing backtracking from a .*? regexp
by BrowserUk (Patriarch) on Nov 18, 2003 at 01:43 UTC | |
|
Re: Removing backtracking from a .*? regexp
by Anonymous Monk on Nov 18, 2003 at 20:47 UTC |