comment on

Here's another thought on the problem. This processes on a buffered line-by-line basis and takes account of the fact that start and stop patterns may appear anywhere. The buffer is prevented from growing without limit while searching for a start pattern, but while searching for a stop pattern, the remainder of the file will be consumed if one is not found.

A line break may appear within a start or stop pattern as long as it can be clearly specified within the pattern; a line break cannot appear at random within a pattern. If such random line breaks (really, record separators, which are usually newlines) appear, the only way I can think of to deal with them is to delete them, e.g., with a chomp, and treat the whole file as a single, unbroken line.

The start and stop regexes in the code example are defined using literal character strings, but they may be any regex. Note the definition of the full-pattern regex has changed from
qr{ $start $not_start* $stop }xms
to
qr{ $start $not_start*? $stop }xms
(addition of ? lazy quantifier modifier) to prevent the regex including multiple stop sequences, should they be present.

use warnings;
use strict;

# maximum possible length of start pattern (and then some).
use constant MAX_START => 10_000;


my $start     = qr{ START         }xms;
my $not_start = qr{ (?! $start) . }xms;
my $stop      = qr{ STOP          }xms;

my $full_pattern = qr{ $start $not_start*? $stop }xms;

my $buffer = '';
my $searching_for_start = 1;

LINE:
while (defined(my $line = <DATA>)) {

    # extracted string WON'T include record seps (usually newlines).
    # comment out if start/stop patterns will not be broken
    # at random across multiple lines.
    chomp $line;

    $buffer .= $line;

    # # looking for start pattern in buffer?
    # if ($searching_for_start) {  # works
    #     # turn flag off if start pattern found.
    #     if ($searching_for_start = $buffer !~ $start) {
    #         # still looking?  limit buffer, search next line.
    #         $buffer = substr $buffer, -MAX_START();
    #         next LINE;
    #         }
    #     }

    # looking for start pattern in buffer and we don't find it?
    if ($searching_for_start &&= $buffer !~ $start) {
        # limit buffer, keep looking.
        $buffer = substr $buffer, -MAX_START();
        next LINE;
        }

    # found start pattern (maybe more than one) in buffer.
    # look for stop pattern (also maybe more than one).
    next LINE unless $buffer =~ $stop;

    # got at least 1 start and stop in buffer.
    # however, buffer may contain STOP then START (inverted order).
    # extract all full patterns, if any.
    my @full_patterns = $buffer =~ m{ $full_pattern }xmsg;

    # keep looking unless we get at least 1 full pattern.
    next LINE unless @full_patterns;

    # clip last full pattern and all before it from buffer.
    $buffer =~ s{ \A .* ($full_pattern) }{}xms;

    # then process extracted pattern(s)...
    process(@full_patterns);

    # ... and go back to searching for start pattern.
    $searching_for_start = 1;

    }

sub process { print "'$_' \n" for @_ }


__DATA__
START
blah
blahblah
START
blah ah
other random stuff
START
blah ha
other random stuff
STOP
asdf
START
STOP
foo
bar
START
fdas
STOP
baz
a START b START c
d e START yes
S
T
OP xxx S
TART xx STAR
T x xx xxx ST
ART oh
yes
STOP
S
T
A
R
T
S
T
O
P

S
T
A
R
T

yes yes yes S
T
O
P
STARTSTOP
START yes STOP
START yes1 STOP START yes2 STOP xxx START yes3 STOP
STOP xxx STOP xxx START xxx START yes4
yes5 STOP  xxx STOP xxx STOP xxx
[download]

Output:

'STARTblah haother random stuffSTOP'
'STARTSTOP'
'STARTfdasSTOP'
'START yesSTOP'
'START ohyesSTOP'
'STARTSTOP'
'STARTyes yes yes STOP'
'STARTSTOP'
'START yes STOP'
'START yes1 STOP'
'START yes2 STOP'
'START yes3 STOP'
'START yes4yes5 STOP'
[download]

In reply to Re: pattern matching (greedy, non-greedy,...) by AnomalousMonk
in thread pattern matching (greedy, non-greedy,...) by cacophony777

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.