Here's another thought on the problem. This processes on a buffered line-by-line basis and takes account of the fact that start and stop patterns may appear anywhere. The buffer is prevented from growing without limit while searching for a start pattern, but while searching for a stop pattern, the remainder of the file will be consumed if one is not found.
A line break may appear within a start or stop pattern as long as it can be clearly specified within the pattern; a line break cannot appear at random within a pattern. If such random line breaks (really, record separators, which are usually newlines) appear, the only way I can think of to deal with them is to delete them, e.g., with a chomp, and treat the whole file as a single, unbroken line.
The start and stop regexes in the code example are defined using
literal character strings, but they may be any regex.
Note the definition of the full-pattern regex
has changed from
qr{ $start $not_start* $stop }xms
to
qr{ $start $not_start*? $stop }xms
(addition of ? lazy quantifier modifier) to prevent
the regex including multiple stop sequences, should they be present.
use warnings; use strict; # maximum possible length of start pattern (and then some). use constant MAX_START => 10_000; my $start = qr{ START }xms; my $not_start = qr{ (?! $start) . }xms; my $stop = qr{ STOP }xms; my $full_pattern = qr{ $start $not_start*? $stop }xms; my $buffer = ''; my $searching_for_start = 1; LINE: while (defined(my $line = <DATA>)) { # extracted string WON'T include record seps (usually newlines). # comment out if start/stop patterns will not be broken # at random across multiple lines. chomp $line; $buffer .= $line; # # looking for start pattern in buffer? # if ($searching_for_start) { # works # # turn flag off if start pattern found. # if ($searching_for_start = $buffer !~ $start) { # # still looking? limit buffer, search next line. # $buffer = substr $buffer, -MAX_START(); # next LINE; # } # } # looking for start pattern in buffer and we don't find it? if ($searching_for_start &&= $buffer !~ $start) { # limit buffer, keep looking. $buffer = substr $buffer, -MAX_START(); next LINE; } # found start pattern (maybe more than one) in buffer. # look for stop pattern (also maybe more than one). next LINE unless $buffer =~ $stop; # got at least 1 start and stop in buffer. # however, buffer may contain STOP then START (inverted order). # extract all full patterns, if any. my @full_patterns = $buffer =~ m{ $full_pattern }xmsg; # keep looking unless we get at least 1 full pattern. next LINE unless @full_patterns; # clip last full pattern and all before it from buffer. $buffer =~ s{ \A .* ($full_pattern) }{}xms; # then process extracted pattern(s)... process(@full_patterns); # ... and go back to searching for start pattern. $searching_for_start = 1; } sub process { print "'$_' \n" for @_ } __DATA__ START blah blahblah START blah ah other random stuff START blah ha other random stuff STOP asdf START STOP foo bar START fdas STOP baz a START b START c d e START yes S T OP xxx S TART xx STAR T x xx xxx ST ART oh yes STOP S T A R T S T O P S T A R T yes yes yes S T O P STARTSTOP START yes STOP START yes1 STOP START yes2 STOP xxx START yes3 STOP STOP xxx STOP xxx START xxx START yes4 yes5 STOP xxx STOP xxx STOP xxx
Output:
'STARTblah haother random stuffSTOP' 'STARTSTOP' 'STARTfdasSTOP' 'START yesSTOP' 'START ohyesSTOP' 'STARTSTOP' 'STARTyes yes yes STOP' 'STARTSTOP' 'START yes STOP' 'START yes1 STOP' 'START yes2 STOP' 'START yes3 STOP' 'START yes4yes5 STOP'
In reply to Re: pattern matching (greedy, non-greedy,...)
by AnomalousMonk
in thread pattern matching (greedy, non-greedy,...)
by cacophony777
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |