regexp over multiple lines

liverpaul has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regexp over multiple lines by davido (Cardinal) on Aug 03, 2011 at 07:14 UTC
Yes, it's possible. First you have to slurp in your input (ie, don't read line by line). Next, you need to set the /s modifier, and possibly /m for your regexp. /s tells the RE engine that '.' should match any character including a newline. /m tells the RE engine that $ and ^ should match at the beginning and ending of lines rather than beginning and ending of the string (that's what \A and \Z are for). Also keep in mind that quantifiers such as `` and `+` are greedy, so `.` probably won't do what you want it to do when you hand it the following: `<tag>asdf</tag><tag>ghjkl</tag>` Unless what you want is for it to be greedy. `my $string = "<tag>asdf</tag><tag>ghjkl</tag>"; if( $string =~ m{<tag>(.*)</tag>} ) { print $1, "\n"; } __END__ asdf</tag><tag>ghjkl` [download] Woops! Now introducing perlre! :) Expect some replies telling you to use a proper XML parser, such as XML::Twig, XML::Parser, XML::LibXML, XML::Simple, etc. And they're right. Better to let a well tested solution do the work for you. Dave	[reply] [d/l] [select]
Re^2: regexp over multiple lines by liverpaul (Acolyte) on Aug 03, 2011 at 07:29 UTC
Thanks for the reply :-) I'm a novice in Perl so I went about this in a very different way! Instead of using an XML parser (which I wasn't aware of), I processed each XML file by replacing each ">" with ">\n" so that I ended up with a file with multiple lines instead of everything on just one line. Since my program has to parse data from XML files and normal HTML files, I would like to avoid using an XML parser because my code is set up the wrong way. I'm going to have to read your advice a few more times because it doesn't immediately make sense to me. I'll try a few tests in my program to see if I can get things working and increase my understanding. I'll post back here for further help :-)	[reply]
Re^3: regexp over multiple lines by Sinistral (Monsignor) on Aug 03, 2011 at 13:33 UTC
The monks have been helping you solve the individual problem that you've defined, but missed the very important point you made here - that your files are XML. Using an XML parser, whether XML::Twig, XML::Simple, XML::LibXML, or something else, is THE way to process XML files (and to head off the argument: Yes, even you can use CPAN). Trying to do so via regular expressions is simply madness. You're recreating tools that have already been created and debugged and replacing them with half baked code that will no doubt miss many edge cases.	[reply]
Re^2: regexp over multiple lines by liverpaul (Acolyte) on Aug 03, 2011 at 07:56 UTC
I will fix the greedy/non-greedy issue. I process the file line by line in a for loop. This is necessary because I sometimes need to check a few lines ahead. If I'm forced to process the file line by line instead of "slurp"ing the file, does that mean I can't regexp over multiple lines?	[reply]
Re^3: regexp over multiple lines by JavaFan (Canon) on Aug 03, 2011 at 08:13 UTC
If I'm forced to process the file line by line instead of "slurp"ing the file, does that mean I can't regexp over multiple lines? If you are visiting your relatives one by one, are you having a family reunion? Unless you concatenate the lines yourself, it's not going to work. There's no magic in the regular expression engine that says, "hmmm, I'm not going to match this line, I'm just going to read one more line from the input to see whether it matches now".	[reply]
Re^4: regexp over multiple lines by liverpaul (Acolyte) on Aug 03, 2011 at 08:27 UTC
Re: regexp over multiple lines by Marshall (Canon) on Aug 03, 2011 at 08:34 UTC
Here is some code to do what you want: #!/usr/bin/perl -w use strict; $/=undef; # undefines the record separator # which is by default \n # this means that there is no "line" # separator my $bigString = <DATA>; # would normally read one "line" # but since record separator is undefined # it reads all the data as a single string # this is what "slurp" the file means my @prices = $bigString =~ m\|<homePrice>\s(.+?)\s</homePrice>\|ig; print "@prices"; # prints: 1.91 295.3 KEuro __DATA__ <homePrice> 1.91</homePrice> <balh></balh><homePrice>295.3 KEuro</homePrice> [download] the regex term \s* means zero or more whitespace characters, there are 5 of them: 'space',\n,\r,\t,\f : space, new line, carriage return, tab, form feed. So this code just ignores any spaces or End-of-Line things that are seen(they are optional, zero or one is ok). The (.+?) means one or more of any character, but "calm your greedy-ness down!" - don't keep going, but stop capturing when the term after the (.+?) matches. A "greedy match" would keep going until it saw the the last possible match of that next term. The /g switch means to "match global" keep going and send all matches to the left. the /i is not needed here, but it means ignore case This \n stuff is more complicated to explain than it is to use. Basically, Perl will almost always do what you expect. It can read line terminations by other operating systems and translate them into the single "\n" character. And when you do a write, it will write your OS specific "\n" thing. Unix uses just <line feed> to mean End-of-Line. Windows (and Network standard TCP/IP) programs use <carriage return>, <line feed> to mean End-of-Line, and some versions of Apple stuff uses <carriage return> to mean End-of-Line. When reading a file on your platform, Perl will translate what it reads into a single \n character. A Perl program on Unix will be able to read my Windows file and it will just see one "\n" at the end of line (the \r that Windows put there is ignored).	[reply] [d/l]
Re^2: regexp over multiple lines by liverpaul (Acolyte) on Aug 03, 2011 at 09:03 UTC
Thanks for such a detialed reply, that helps a lot. I even understood some of it! :-) I'm beginning to see that "sluping" the file is the way to go. It would make my code less complicated and messy. From what I understand, it would also allow me to see ahead by using a regexp over several lines. My problem is that my current code works for all the files that need processing, except this one. So, although eventually I will find the time to change it, for now I think I'm going to try to concatenate a few of the lines to hopefully achieve my goal. It may be messy, but it's the method that's least likely to mess up other areas of my code, I think.	[reply]
Re^3: regexp over multiple lines by ww (Archbishop) on Aug 03, 2011 at 12:24 UTC
Consider this scenario: You have a contract to build a 121 story office tower. You've had problems excavating deep enough to put in the foundation. It's been a messy job but you've gotten close. Now, you've started pouring footings and foundation... and in fact, have managed to get the steel up for the first few stories above ground. That's your code to date. But today, your consultant -- the engineer -- notices that the walls are off plumb -- are tilting, out of whack. They ascertain that your footings and foundation are NOT on bedrock. Do you charge onward, to see how many stories up you can go before the whole enterprise crashes? Unless this is a one-off project, it's going to cost less to tear down what you've done, and get the footings right before continuing.	[reply]
Re^4: regexp over multiple lines by liverpaul (Acolyte) on Aug 03, 2011 at 15:51 UTC
Re^5: regexp over multiple lines by ww (Archbishop) on Aug 03, 2011 at 22:57 UTC
Some notes below your chosen depth have not been shown here
Re: regexp over multiple lines by blindluke (Hermit) on Aug 03, 2011 at 07:18 UTC
Use the /m modifier. This will treat the string as multiple lines, allowing you to use \n characters as part of your regexp. See perlre for details and an example. Also, if you want to parse XML, you could look for a suitable module, it's usually a better way than to parse the file yourself using regexps. Luke Jefferson	[reply]
Re^2: regexp over multiple lines by JavaFan (Canon) on Aug 03, 2011 at 07:43 UTC
Use the /m modifier. This will treat the string as multiple lines, allowing you to use \n characters as part of your regexp. Bullshit. You don't need any modifier to use `\n` in your regexp. The `/m` modifier will change the meaning of `^` and `$`, irrelevant for the OP. He may need the `/s` modifier, which makes the dot match a newline.	[reply] [d/l] [select]
Re^3: regexp over multiple lines by blindluke (Hermit) on Aug 03, 2011 at 08:17 UTC
Thanks for pointing that out, it seems I got it all wrong. Apologies to liverpaul, if my response was misleading.	[reply]
Re^3: regexp over multiple lines by liverpaul (Acolyte) on Aug 03, 2011 at 08:08 UTC
Thanks for the reply :-) So, if my file looked like this: `line1data line2data line3data line4data <homePrice> 1.91</homePrice> line7data line8data line9data` [download] ...and I was forced to process the file line by line in a for loop, what regexp would I use (if any are possible) to extract the value of 1.91 by matching both lines?	[reply] [d/l]
Re^4: regexp over multiple lines by JavaFan (Canon) on Aug 03, 2011 at 08:31 UTC
Re^5: regexp over multiple lines by liverpaul (Acolyte) on Aug 03, 2011 at 09:12 UTC