liverpaul has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I was wondering if it's possible to use regexp to match a pattern over multiple lines. For example, the file I'm processing contains:
<homePrice> 1.91</homePrice>
To try to extract the number 1.91, I've tried the following regexp without success: <homePrice>\n(\d*\.\d*)<\/homePrice> Viewing the file in Padre IDE for Windows, I can see that each line ends with CRLF, maybe that's relevant? Any ideas? Thanks :-)

Replies are listed 'Best First'.
Re: regexp over multiple lines
by davido (Cardinal) on Aug 03, 2011 at 07:14 UTC

    Yes, it's possible. First you have to slurp in your input (ie, don't read line by line). Next, you need to set the /s modifier, and possibly /m for your regexp. /s tells the RE engine that '.' should match any character including a newline. /m tells the RE engine that $ and ^ should match at the beginning and ending of lines rather than beginning and ending of the string (that's what \A and \Z are for).

    Also keep in mind that quantifiers such as * and + are greedy, so .* probably won't do what you want it to do when you hand it the following:

    <tag>asdf</tag><tag>ghjkl</tag>

    Unless what you want is for it to be greedy.

    my $string = "<tag>asdf</tag><tag>ghjkl</tag>"; if( $string =~ m{<tag>(.*)</tag>} ) { print $1, "\n"; } __END__ asdf</tag><tag>ghjkl

    Woops!

    Now introducing perlre! :)

    Expect some replies telling you to use a proper XML parser, such as XML::Twig, XML::Parser, XML::LibXML, XML::Simple, etc. And they're right. Better to let a well tested solution do the work for you.


    Dave

      Thanks for the reply :-) I'm a novice in Perl so I went about this in a very different way! Instead of using an XML parser (which I wasn't aware of), I processed each XML file by replacing each ">" with ">\n" so that I ended up with a file with multiple lines instead of everything on just one line. Since my program has to parse data from XML files and normal HTML files, I would like to avoid using an XML parser because my code is set up the wrong way. I'm going to have to read your advice a few more times because it doesn't immediately make sense to me. I'll try a few tests in my program to see if I can get things working and increase my understanding. I'll post back here for further help :-)

        The monks have been helping you solve the individual problem that you've defined, but missed the very important point you made here - that your files are XML. Using an XML parser, whether XML::Twig, XML::Simple, XML::LibXML, or something else, is THE way to process XML files (and to head off the argument: Yes, even you can use CPAN). Trying to do so via regular expressions is simply madness. You're recreating tools that have already been created and debugged and replacing them with half baked code that will no doubt miss many edge cases.

      I will fix the greedy/non-greedy issue. I process the file line by line in a for loop. This is necessary because I sometimes need to check a few lines ahead. If I'm forced to process the file line by line instead of "slurp"ing the file, does that mean I can't regexp over multiple lines?
        If I'm forced to process the file line by line instead of "slurp"ing the file, does that mean I can't regexp over multiple lines?
        If you are visiting your relatives one by one, are you having a family reunion?

        Unless you concatenate the lines yourself, it's not going to work. There's no magic in the regular expression engine that says, "hmmm, I'm not going to match this line, I'm just going to read one more line from the input to see whether it matches now".

Re: regexp over multiple lines
by Marshall (Canon) on Aug 03, 2011 at 08:34 UTC
    Here is some code to do what you want:
    #!/usr/bin/perl -w use strict; $/=undef; # undefines the record separator # which is by default \n # this means that there is no "line" # separator my $bigString = <DATA>; # would normally read one "line" # but since record separator is undefined # it reads all the data as a single string # this is what "slurp" the file means my @prices = $bigString =~ m|<homePrice>\s*(.+?)\s*</homePrice>|ig; print "@prices"; # prints: 1.91 295.3 KEuro __DATA__ <homePrice> 1.91</homePrice> <balh></balh><homePrice>295.3 KEuro</homePrice>
    the regex term \s* means zero or more whitespace characters, there are 5 of them: 'space',\n,\r,\t,\f : space, new line, carriage return, tab, form feed. So this code just ignores any spaces or End-of-Line things that are seen(they are optional, zero or one is ok).

    The (.+?) means one or more of any character, but "calm your greedy-ness down!" - don't keep going, but stop capturing when the term after the (.+?) matches. A "greedy match" would keep going until it saw the the last possible match of that next term.

    The /g switch means to "match global" keep going and send all matches to the left. the /i is not needed here, but it means ignore case

    This \n stuff is more complicated to explain than it is to use. Basically, Perl will almost always do what you expect. It can read line terminations by other operating systems and translate them into the single "\n" character. And when you do a write, it will write your OS specific "\n" thing.

    Unix uses just <line feed> to mean End-of-Line. Windows (and Network standard TCP/IP) programs use <carriage return>, <line feed> to mean End-of-Line, and some versions of Apple stuff uses <carriage return> to mean End-of-Line. When reading a file on your platform, Perl will translate what it reads into a single \n character. A Perl program on Unix will be able to read my Windows file and it will just see one "\n" at the end of line (the \r that Windows put there is ignored).

      Thanks for such a detialed reply, that helps a lot. I even understood some of it! :-)

      I'm beginning to see that "sluping" the file is the way to go. It would make my code less complicated and messy. From what I understand, it would also allow me to see ahead by using a regexp over several lines.

      My problem is that my current code works for all the files that need processing, except this one. So, although eventually I will find the time to change it, for now I think I'm going to try to concatenate a few of the lines to hopefully achieve my goal. It may be messy, but it's the method that's least likely to mess up other areas of my code, I think.

        Consider this scenario:

        You have a contract to build a 121 story office tower. You've had problems excavating deep enough to put in the foundation. It's been a messy job but you've gotten close.

        Now, you've started pouring footings and foundation... and in fact, have managed to get the steel up for the first few stories above ground.

        That's your code to date.

        But today, your consultant -- the engineer -- notices that the walls are off plumb -- are tilting, out of whack. They ascertain that your footings and foundation are NOT on bedrock.

        Do you charge onward, to see how many stories up you can go before the whole enterprise crashes?

        Unless this is a one-off project, it's going to cost less to tear down what you've done, and get the footings right before continuing.

Re: regexp over multiple lines
by blindluke (Hermit) on Aug 03, 2011 at 07:18 UTC

    Use the /m modifier. This will treat the string as multiple lines, allowing you to use \n characters as part of your regexp.

    See perlre for details and an example. Also, if you want to parse XML, you could look for a suitable module, it's usually a better way than to parse the file yourself using regexps.

    Luke Jefferson

      Use the /m modifier. This will treat the string as multiple lines, allowing you to use \n characters as part of your regexp.
      Bullshit.

      You don't need any modifier to use \n in your regexp. The /m modifier will change the meaning of ^ and $, irrelevant for the OP. He may need the /s modifier, which makes the dot match a newline.

        Thanks for pointing that out, it seems I got it all wrong. Apologies to liverpaul, if my response was misleading.

        Thanks for the reply :-)

        So, if my file looked like this:

        line1data line2data line3data line4data <homePrice> 1.91</homePrice> line7data line8data line9data

        ...and I was forced to process the file line by line in a for loop, what regexp would I use (if any are possible) to extract the value of 1.91 by matching both lines?