Re: regex question: store multiple lines as a string

I wonder whether the /s and/or /m operators might be useful here.

If the file is “quite large,” as I assume it is, one strategy might to be to take each line that is read and, first, concatenate it (and a newline) to a buffer string. Then, repeatedly regex that string using the /m and /p modifiers. Each time the string matches, extract the matched portion using {$^MATCH} (“what matched”), then assign the string to be {$^POSTMATCH} (“what follows”). Repeat this until the pattern no longer matches. Something like this:

  my $buffer = "";
  do {
    my $line = <$fh>;
    $buffer .= "\n$line" if defined($line);  # I.E. NOT END-OF-FILE
    while ($buffer =~ /$pattern/mp) {
      process(${^MATCH});
      $buffer = ${^POSTMATCH};
    }
  } while(defined($line));      # I.E. END-OF-FILE.
[download]

You need to be sure that your pattern is set up so that it is not “greedy.” By default, a regex will match as much of the string as it can ... “always taking the biggest possible piece of the pie,” if you will. But you don’t want that to happen. If, at any time, the buffer contains more than one complete occurrences of whatever it is that you are looking for, you want to grab each one in turn. Let me explain...

Let’s say that you want to find whatever is between BEGIN and END in some string. And let’s say that our test-string, just for fun, consists of:
“BEGIN FOO END BEGIN BAR END.”

A “greedy” pattern, such as (say...) /BEGIN(.*)END/, would grab the longest possible substring that still permits the entire pattern to match, viz:
FOO END BEGIN BAR.

Because the regex went for the longest string, it grabbed everything that it found between the first occurrence of BEGIN and the last occurrence of END. This is obviously not what we want. But, if we insert the '?' modifier into the pattern, it now grabs the shortest possible match. A pattern such as /BEGIN(.*)?END/ would now match:

FOO the first time.
BAR the second.

(Caution: extemporaneous coding. There might be syntax errors. Do not try this at home.)