comment on

That'll work fine, but it will have efficiency problems on some types of data. Here's how the regexp engine would try to match that pattern when successful:

  find the first occurrence of '<IPDR>'      # fast
  walk the string to the '</IPDR>'           # fastish
  skip back to the last occurrence of 'test' # fast
  walk forward to the '</IPDR>' again        # fastish
  signal success                             # done
[download]

however if the tail were missing, eg if it had '<IPDR>' at the end by mistake instead of '</IPDR>', it would go like this:

  find the first occurrence of '<IPDR>'      # fast
  walk the string to the end                 # fastish
  skip back to the last occurrence of 'test' # fast
  walk forward to end of string again        # fastish
  repeat last two steps for each additional occurrence of 'test'
    # slow - quadratic in the number of occurrences of 'test'
  (no more 'test's to try) signal failure    # done
[download]

Now, this may not be a problem for the original poster: they may already have confirmed that their data is well-formed. However it is an avoidable problem, either by checking for 'test' in a separate grep (as in your original solution), or by using the 'cut' operator to avoid the useless quadratic backtracking.

Here's some code to benchmark the difference, trying with and without the 'cut' operator, with strings that both do and don't match, and with 1, 10, 100 or 1000 copies of 'test' in the string:

use strict;
use Benchmark qw/ cmpthese timethese /;
my($head, $tailg, $tailb, $test, $fill)
        = ("<IPDR>", "</IPDR>", "</IPDR", "test", "x" x 100);
my $reuncut = qr{
    $head
    (?:(?!$tailg).)*
    $test
    (?:(?!$tailg).)*
    $tailg
}sxio;
my $recut = qr{
    (?>
        $head
        (?:(?!$tailg).)*
        $test
    )
    (?:(?!$tailg).)*
    $tailg
}sxio;

my $trial;
for my $bool (qw/ g b /) {
    my $tail = ($bool eq 'g') ? $tailg : $tailb;
    for my $cut (qw/ cut uncut /) {
        my $re = ($cut eq 'cut') ? $recut : $reuncut;
        for my $count (1, 10, 100, 1000) {
            my $str = join '', $head, $fill, $test x $count, $fill, $t
+ail;
            $trial->{"$cut$count$bool"} = sub {
                die if ($bool eq 'g') ^ ($str =~ $re);
            };
        }
    }
}
timethese(-1, $trial);
[download]

And the (reformatted) results:

   uncut1g:   12799.05/s
     cut1g:   12799.05/s
   uncut1b:   11821.50/s
     cut1b:    8219.27/s

  uncut10g:   11486.54/s
    cut10g:   11486.54/s
  uncut10b:    2559.05/s
    cut10b:    7244.34/s

 uncut100g:    5743.27/s
   cut100g:    5743.27/s
 uncut100b:     125.00/s
   cut100b:    3381.13/s

uncut1000g:     956.48/s
  cut1000g:     956.48/s
uncut1000b:       1.85/s
  cut1000b:     532.38/s
[download]

From this it is clear that a) the cut operator costs nothing when it is not used (all the g(ood) matches are the same speed with and without the cut), b) when there is just one 'test' to backtrack over the cost of the extra bookkeeping is quite high (cut1b is 36% slower than the good case, while uncut1b is only 8% slower), but c) the cost is linear rather than quadratic (as the number of 'test's increases, the 'cut<n>b' version continues to give away 36% to the 'cut<n>g' version, while the 'uncut<n>b' gets slower much more dramatically.

Hugo

In reply to Re^3: one line regexp problem by hv
in thread one line regexp problem by hoodlooms

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.