parsing question

Washie101 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: parsing question by Abigail-II (Bishop) on Sep 12, 2003 at 11:15 UTC
`/_test(?>\s+)(?!<)/` Abigail	[reply] [d/l]
Re: Re: parsing question by allolex (Curate) on Sep 12, 2003 at 12:27 UTC
Abigail's extended regular expression is also an opportunity to show you a nifty module called YAPE::Regex::Explain. ladoix% cat 290992 #!/usr/bin/perl use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/_test(?>\s+)(?!<)/)->explain; ladoix% perl 290992 The regular expression: (?-imsx:_test(?>\s+)(?!<)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- _test '_test' ---------------------------------------------------------------------- (?> match (and do not backtrack afterwards): ---------------------------------------------------------------------- \s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- < '<' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] -- Allolex	[reply] [d/l]
Re: parsing question by Abigail-II (Bishop) on Sep 12, 2003 at 12:50 UTC
Unfortunally, it only explains what it does, but it doesn't explain why it does so. Perhaps the most subtle part of the regex is `(?>\s+)`. Can you explain why it uses "no backtracking"? ;-) Abigail	[reply] [d/l]
2Re: parsing question by bart (Canon) on Sep 13, 2003 at 10:31 UTC
Re: Re: parsing question by allolex (Curate) on Sep 12, 2003 at 13:06 UTC
Re: parsing question by Abigail-II (Bishop) on Sep 12, 2003 at 13:13 UTC
Re: Re: parsing question by Roger (Parson) on Sep 15, 2003 at 02:42 UTC
Out of interest with the experimental ?>, I did a benchmark with the following little test: `use Benchmark; $str1 = "_test (folloed by 1 or more spaces)"; $str2 = "_test < xxx >"; timethese ( 1000000, { 'p1' => '&p1;', 'p2' => '&p2;', 'p3' => '&p3;', 'p4' => '&p4;', } ); sub p1 () { $str1 =~ /_test(?>\s+)(?!<)/; } sub p2 () { $str1 =~ /_test(?:\s+)(?!<)/; } sub p3 () { $str2 =~ /_test(?>\s+)(?!<)/; } sub p4 () { $str2 =~ /_test(?:\s+)(?!<)/; }` [download] I got the following results: `Benchmark: timing 1000000 iterations of p1, p2, p3, p4... p1: 3 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 333333.33/s (n=1000000) p2: 3 wallclock secs ( 2.79 usr + 0.00 sys = 2.79 CPU) @ 358422.94/s (n=1000000) p3: 3 wallclock secs ( 3.09 usr + 0.00 sys = 3.09 CPU) @ 323624.60/s (n=1000000) p4: 3 wallclock secs ( 2.82 usr + 0.00 sys = 2.82 CPU) @ 354609.93/s (n=1000000)` [download] It seems that the ?> runs slower than ?: matching by as much as 10 percent. So am I correct to say that optimization wise, the ?> might not be the first choice?	[reply] [d/l] [select]
Re: parsing question by Abigail-II (Bishop) on Sep 15, 2003 at 07:01 UTC
Considering that `/_test(?:\s+)(?!<)/` is wrong, as demonstrated elsewhere in this thread, I fail to see your point. Abigail	[reply] [d/l]
Re: Re: parsing question by Roger (Parson) on Sep 15, 2003 at 09:07 UTC
Re: parsing question by Abigail-II (Bishop) on Sep 15, 2003 at 11:36 UTC
Some notes below your chosen depth have not been shown here
Re: Re: parsing question by Roger (Parson) on Sep 15, 2003 at 02:42 UTC
Out of interest with the experimental ?>, I did a benchmark with the following little test: `use Benchmark; $str1 = "_test (folloed by 1 or more spaces)"; $str2 = "_test < xxx >"; timethese ( 1000000, { 'p1' => '&p1;', 'p2' => '&p2;', 'p3' => '&p3;', 'p4' => '&p4;', } ); sub p1 () { $str1 =~ /_test(?>\s+)(?!<)/; } sub p2 () { $str1 =~ /_test(?:\s+)(?!<)/; } sub p3 () { $str2 =~ /_test(?>\s+)(?!<)/; } sub p4 () { $str2 =~ /_test(?:\s+)(?!<)/; }` [download] I got the following results: `Benchmark: timing 1000000 iterations of p1, p2, p3, p4... p1: 3 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 333333.33/s (n=1000000) p2: 3 wallclock secs ( 2.79 usr + 0.00 sys = 2.79 CPU) @ 358422.94/s (n=1000000) p3: 3 wallclock secs ( 3.09 usr + 0.00 sys = 3.09 CPU) @ 323624.60/s (n=1000000) p4: 3 wallclock secs ( 2.82 usr + 0.00 sys = 2.82 CPU) @ 354609.93/s (n=1000000)` [download] It seems that the ?> runs slower than ?: matching by as much as 10 percent. So am I correct to say that optimization wise, the ?> might not be the first choice?	[reply] [d/l] [select]
Re: Re: Re: parsing question by Anonymous Monk on Sep 15, 2003 at 03:06 UTC
Switching timethese with cmpthese , here's the math Win32 ActivePerl 5.6.1 (Build 633) Rate p4 p3 p1 p2 p4 865052/s -- -1% -1% -4% p3 876424/s 1% -- -0% -3% p1 877193/s 1% 0% -- -3% p2 901713/s 4% 3% 3% -- Win32 ActivePerl 5.8.0 (build 804) Rate p3 p1 p2 p4 p3 831255/s -- -3% -5% -8% p1 853971/s 3% -- -3% -5% p2 876424/s 5% 3% -- -3% p4 900901/s 8% 5% 3% -- p1 and p3 use the "cut" operator. The optimization depends on your perl version.	[reply]
Re: parsing question by flounder99 (Friar) on Sep 12, 2003 at 13:20 UTC
`/_test\s+(?!\s\|<)/;` [download] also works but without using any so called "experimental" regex extended patterns. -- flounder	[reply] [d/l]
Re: parsing question by hmerrill (Friar) on Sep 12, 2003 at 12:32 UTC
Seems to me you need another regex to test if you've found a line containing the angle brackets: `if ($line =~ /_test(\s+)</) { ### skip this one ### } elif ($line =~ /_test(\s+)/) { print "found\n"; }` [download] HTH.	[reply] [d/l]
Re: Re: parsing question by Washie101 (Novice) on Sep 12, 2003 at 12:36 UTC
I was trying to avoid an if else statement.. Abeigills solution worked a treat...tHanx a million	[reply]