sch has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, hope someone can help me with this - I've been playing with regex's and extracting matching text, with some success

However, I'm having a problem with multi-line regex's. I've got the following text

<P>
THE GENERAL SYNOPSIS AT 0100<BR>
LOW SOUTH FITZROY 1000 MOVING SLOWLY NORTH AND FILLING 1006 BY 0100<BR>
TOMORROW. NEW LOW EXPECTED 50 MILES WEST OF TRAFALGAR 1007 BY SAME<BR>
TIME. HIGH 100 MILES WEST OF ROCKALL 1023 SLOW MOVING AND DECLINING<BR>
1021 BY THAT TIME<BR>
<P>
THE AREA FORECASTS FOR THE NEXT 24 HOURS<BR>
(which some of you may recognise as part of the UK shipping forecast)

and I'm trying to extract it so $1 contains the 0100 and $2 contains

LOW SOUTH FITZROY 1000 MOVING SLOWLY NORTH AND FILLING 1006 BY 0100<BR>
TOMORROW. NEW LOW EXPECTED 50 MILES WEST OF TRAFALGAR 1007 BY SAME<BR>
TIME. HIGH 100 MILES WEST OF ROCKALL 1023 SLOW MOVING AND DECLINING<BR>
1021 BY THAT TIME<BR>

I've put together this:

$_ = $response->content; m/GENERAL SYNOPSIS AT ([0-9]{4})<BR>\n(^.*<BR>\n)+<P>/mi; print "===> ".$1." ".$2."\n";
which gives me
===> 0100 1021 BY THAT TIME<BR>
so I'm capturing $1 ok, but I can't work out how to get the $2 to capture multi-line text.

Can anyone help me out here?
(and please feel free to tell me if I can optimise the regex at all)

Replies are listed 'Best First'.
Re: Multi-Line Regex's
by davorg (Chancellor) on Sep 18, 2002 at 12:19 UTC

    /m changes the effect of ^ and $ to match at the start and end of lines (rather than the whole string). You need /s which changes the meaning of . so it matches \n.

    I remember it as /s changes the meaning of a single metacharacter and /m changes the meaning of multiple metacharacters.

    #!/usr/bin/perl -w use strict; $_ = '<P> THE GENERAL SYNOPSIS AT 0100<BR> LOW SOUTH FITZROY 1000 MOVING SLOWLY NORTH AND FILLING 1006 BY 0100<BR +> TOMORROW. NEW LOW EXPECTED 50 MILES WEST OF TRAFALGAR 1007 BY SAME<BR> TIME. HIGH 100 MILES WEST OF ROCKALL 1023 SLOW MOVING AND DECLINING<BR +> 1021 BY THAT TIME<BR> <P> THE AREA FORECASTS FOR THE NEXT 24 HOURS<BR>'; /GENERAL SYNOPSIS AT (\d{4})<BR>\s+(.*)\s<P>/s; print "1 -> $1\n2 -> $2\n";

    Of course the usual caveats about not parsing HTML with regexes still apply :)

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      Of course the usual caveats about not parsing HTML with regexes still apply :)

      While I can see in general that handling big chunks of html is preferably done with things like HTML::Parser, in this simple case where I'm trying to grab one paragraph which is easily delimited from a specific webpage is there any real advantage to those tools?

        Well, only the fact that HTML parsers will actually parse the HTML for you - whereas any regex-based solution will only handle a subset of the possible HTML and will prove extremely fragile if the HTML ever changes.

        You might like to take a look at the section "How not to parse HTML" in chapter 8 of Data Munging with Perl.

        --
        <http://www.dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg

Re: Multi-Line Regex's
by Nemp (Pilgrim) on Sep 18, 2002 at 12:17 UTC
    Hi,

    I speak from personal experience when I say if I were you I'd avoid the regex's for this problem entirely and use some kind of HTML extraction module instead - makes it a lot easier and you don't need to worry about multi-line elements.

    Try looking up HTML::Parser, HTML::TokeParser, HTML::TagFilter, HTML::TokeParser::Simple or something similar

    HTH,
    Neil
Re: Multi-Line Regex's
by sch (Pilgrim) on Sep 18, 2002 at 12:14 UTC

    phew, played around a bit more and I've managed to get to:

    $_ = $response->content; m/GENERAL SYNOPSIS AT ([0-9]{4})<BR>\n((^.+<BR>\n)+)<P>/mi; print "===> ".$1." ".$2."\n";
    which gives me the output of
    ===> 0100  LOW SOUTH FITZROY 1000 MOVING SLOWLY NORTH AND FILLING 1006 BY 0100<BR>
    TOMORROW. NEW LOW EXPECTED 50 MILES WEST OF TRAFALGAR 1007 BY SAME<BR>
    TIME. HIGH 100 MILES WEST OF ROCKALL 1023 SLOW MOVING AND DECLINING<BR>
    1021 BY THAT TIME<BR>
    
    which is what I wanted.

    But is that the best regex to get that result?

Re: Multi-Line Regex's
by zaimoni (Beadle) on Sep 19, 2002 at 02:04 UTC

    You omitted the s modifier on the regex: it should end with mis rather than mi. That turns your single-line regex into a multi-line regex. I normally use that modifier on general principles.