in reply to regexp over multiple lines

Here is some code to do what you want:
#!/usr/bin/perl -w use strict; $/=undef; # undefines the record separator # which is by default \n # this means that there is no "line" # separator my $bigString = <DATA>; # would normally read one "line" # but since record separator is undefined # it reads all the data as a single string # this is what "slurp" the file means my @prices = $bigString =~ m|<homePrice>\s*(.+?)\s*</homePrice>|ig; print "@prices"; # prints: 1.91 295.3 KEuro __DATA__ <homePrice> 1.91</homePrice> <balh></balh><homePrice>295.3 KEuro</homePrice>
the regex term \s* means zero or more whitespace characters, there are 5 of them: 'space',\n,\r,\t,\f : space, new line, carriage return, tab, form feed. So this code just ignores any spaces or End-of-Line things that are seen(they are optional, zero or one is ok).

The (.+?) means one or more of any character, but "calm your greedy-ness down!" - don't keep going, but stop capturing when the term after the (.+?) matches. A "greedy match" would keep going until it saw the the last possible match of that next term.

The /g switch means to "match global" keep going and send all matches to the left. the /i is not needed here, but it means ignore case

This \n stuff is more complicated to explain than it is to use. Basically, Perl will almost always do what you expect. It can read line terminations by other operating systems and translate them into the single "\n" character. And when you do a write, it will write your OS specific "\n" thing.

Unix uses just <line feed> to mean End-of-Line. Windows (and Network standard TCP/IP) programs use <carriage return>, <line feed> to mean End-of-Line, and some versions of Apple stuff uses <carriage return> to mean End-of-Line. When reading a file on your platform, Perl will translate what it reads into a single \n character. A Perl program on Unix will be able to read my Windows file and it will just see one "\n" at the end of line (the \r that Windows put there is ignored).

Replies are listed 'Best First'.
Re^2: regexp over multiple lines
by liverpaul (Acolyte) on Aug 03, 2011 at 09:03 UTC

    Thanks for such a detialed reply, that helps a lot. I even understood some of it! :-)

    I'm beginning to see that "sluping" the file is the way to go. It would make my code less complicated and messy. From what I understand, it would also allow me to see ahead by using a regexp over several lines.

    My problem is that my current code works for all the files that need processing, except this one. So, although eventually I will find the time to change it, for now I think I'm going to try to concatenate a few of the lines to hopefully achieve my goal. It may be messy, but it's the method that's least likely to mess up other areas of my code, I think.

      Consider this scenario:

      You have a contract to build a 121 story office tower. You've had problems excavating deep enough to put in the foundation. It's been a messy job but you've gotten close.

      Now, you've started pouring footings and foundation... and in fact, have managed to get the steel up for the first few stories above ground.

      That's your code to date.

      But today, your consultant -- the engineer -- notices that the walls are off plumb -- are tilting, out of whack. They ascertain that your footings and foundation are NOT on bedrock.

      Do you charge onward, to see how many stories up you can go before the whole enterprise crashes?

      Unless this is a one-off project, it's going to cost less to tear down what you've done, and get the footings right before continuing.

        The trouble is, I'm not a builder. I'm just teaching myself the building trade as I go along :-)

        This is just a once off project for my website, so the only thing it costs me is time and effort.

        I've decided that I'm going to try to do it the way everyone is recommending, I'm just not sure I have the ability to do it that way...yet. You see, the data I need to extract will be from HTML files and XML files. I will be trying to design a program that will process both types of input. I'll give it a day or two and see how I get on.