Here's another thought on the problem. This processes on a buffered line-by-line basis and takes account of the fact that start and stop patterns may appear anywhere. The buffer is prevented from growing without limit while searching for a start pattern, but while searching for a stop pattern, the remainder of the file will be consumed if one is not found.

A line break may appear within a start or stop pattern as long as it can be clearly specified within the pattern; a line break cannot appear at random within a pattern. If such random line breaks (really, record separators, which are usually newlines) appear, the only way I can think of to deal with them is to delete them, e.g., with a chomp, and treat the whole file as a single, unbroken line.

The start and stop regexes in the code example are defined using literal character strings, but they may be any regex. Note the definition of the full-pattern regex has changed from
    qr{ $start $not_start* $stop }xms
to
    qr{ $start $not_start*? $stop }xms
(addition of  ? lazy quantifier modifier) to prevent the regex including multiple stop sequences, should they be present.

use warnings; use strict; # maximum possible length of start pattern (and then some). use constant MAX_START => 10_000; my $start = qr{ START }xms; my $not_start = qr{ (?! $start) . }xms; my $stop = qr{ STOP }xms; my $full_pattern = qr{ $start $not_start*? $stop }xms; my $buffer = ''; my $searching_for_start = 1; LINE: while (defined(my $line = <DATA>)) { # extracted string WON'T include record seps (usually newlines). # comment out if start/stop patterns will not be broken # at random across multiple lines. chomp $line; $buffer .= $line; # # looking for start pattern in buffer? # if ($searching_for_start) { # works # # turn flag off if start pattern found. # if ($searching_for_start = $buffer !~ $start) { # # still looking? limit buffer, search next line. # $buffer = substr $buffer, -MAX_START(); # next LINE; # } # } # looking for start pattern in buffer and we don't find it? if ($searching_for_start &&= $buffer !~ $start) { # limit buffer, keep looking. $buffer = substr $buffer, -MAX_START(); next LINE; } # found start pattern (maybe more than one) in buffer. # look for stop pattern (also maybe more than one). next LINE unless $buffer =~ $stop; # got at least 1 start and stop in buffer. # however, buffer may contain STOP then START (inverted order). # extract all full patterns, if any. my @full_patterns = $buffer =~ m{ $full_pattern }xmsg; # keep looking unless we get at least 1 full pattern. next LINE unless @full_patterns; # clip last full pattern and all before it from buffer. $buffer =~ s{ \A .* ($full_pattern) }{}xms; # then process extracted pattern(s)... process(@full_patterns); # ... and go back to searching for start pattern. $searching_for_start = 1; } sub process { print "'$_' \n" for @_ } __DATA__ START blah blahblah START blah ah other random stuff START blah ha other random stuff STOP asdf START STOP foo bar START fdas STOP baz a START b START c d e START yes S T OP xxx S TART xx STAR T x xx xxx ST ART oh yes STOP S T A R T S T O P S T A R T yes yes yes S T O P STARTSTOP START yes STOP START yes1 STOP START yes2 STOP xxx START yes3 STOP STOP xxx STOP xxx START xxx START yes4 yes5 STOP xxx STOP xxx STOP xxx

Output:

'STARTblah haother random stuffSTOP' 'STARTSTOP' 'STARTfdasSTOP' 'START yesSTOP' 'START ohyesSTOP' 'STARTSTOP' 'STARTyes yes yes STOP' 'STARTSTOP' 'START yes STOP' 'START yes1 STOP' 'START yes2 STOP' 'START yes3 STOP' 'START yes4yes5 STOP'

In reply to Re: pattern matching (greedy, non-greedy,...) by AnomalousMonk
in thread pattern matching (greedy, non-greedy,...) by cacophony777

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.