Multi-Line Regex's

sch has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, hope someone can help me with this - I've been playing with regex's and extracting matching text, with some success

However, I'm having a problem with multi-line regex's. I've got the following text

<P>
THE GENERAL SYNOPSIS AT 0100<BR>
LOW SOUTH FITZROY 1000 MOVING SLOWLY NORTH AND FILLING 1006 BY 0100<BR>
TOMORROW. NEW LOW EXPECTED 50 MILES WEST OF TRAFALGAR 1007 BY SAME<BR>
TIME. HIGH 100 MILES WEST OF ROCKALL 1023 SLOW MOVING AND DECLINING<BR>
1021 BY THAT TIME<BR>
<P>
THE AREA FORECASTS FOR THE NEXT 24 HOURS<BR>

(which some of you may recognise as part of the UK shipping forecast)

and I'm trying to extract it so $1 contains the 0100 and $2 contains


LOW SOUTH FITZROY 1000 MOVING SLOWLY NORTH AND FILLING 1006 BY 0100<BR>
TOMORROW. NEW LOW EXPECTED 50 MILES WEST OF TRAFALGAR 1007 BY SAME<BR>
TIME. HIGH 100 MILES WEST OF ROCKALL 1023 SLOW MOVING AND DECLINING<BR>
1021 BY THAT TIME<BR>

I've put together this:

  $_ = $response->content;
  m/GENERAL SYNOPSIS AT ([0-9]{4})<BR>\n(^.*<BR>\n)+<P>/mi;
  print "===> ".$1."  ".$2."\n";
[download]

which gives me


===> 0100  1021 BY THAT TIME<BR>
[download]

so I'm capturing $1 ok, but I can't work out how to get the $2 to capture multi-line text.

Can anyone help me out here?
(and please feel free to tell me if I can optimise the regex at all)

Comment on Multi-Line Regex's Select or Download Code

Replies are listed 'Best First'.
Re: Multi-Line Regex's by davorg (Chancellor) on Sep 18, 2002 at 12:19 UTC
`/m` changes the effect of `^` and `$` to match at the start and end of lines (rather than the whole string). You need `/s` which changes the meaning of `.` so it matches `\n`. I remember it as `/s` changes the meaning of a single metacharacter and `/m` changes the meaning of multiple metacharacters. `#!/usr/bin/perl -w use strict; $_ = '<P> THE GENERAL SYNOPSIS AT 0100<BR> LOW SOUTH FITZROY 1000 MOVING SLOWLY NORTH AND FILLING 1006 BY 0100<BR +> TOMORROW. NEW LOW EXPECTED 50 MILES WEST OF TRAFALGAR 1007 BY SAME<BR> TIME. HIGH 100 MILES WEST OF ROCKALL 1023 SLOW MOVING AND DECLINING<BR +> 1021 BY THAT TIME<BR> <P> THE AREA FORECASTS FOR THE NEXT 24 HOURS<BR>'; /GENERAL SYNOPSIS AT (\d{4})<BR>\s+(.)\s<P>/s; print "1 -> $1\n2 -> $2\n";` [download] Of course the usual caveats about not parsing HTML with regexes still apply :) -- <http://www.dave.org.uk> "The first rule of Perl club is you do not talk about Perl club."* -- Chip Salzenberg	[reply] [d/l]
Re: Re: Multi-Line Regex's by sch (Pilgrim) on Sep 18, 2002 at 13:54 UTC
Of course the usual caveats about not parsing HTML with regexes still apply :) While I can see in general that handling big chunks of html is preferably done with things like HTML::Parser, in this simple case where I'm trying to grab one paragraph which is easily delimited from a specific webpage is there any real advantage to those tools?	[reply]
Re: Re: Re: Multi-Line Regex's by davorg (Chancellor) on Sep 18, 2002 at 14:01 UTC
Well, only the fact that HTML parsers will actually parse the HTML for you - whereas any regex-based solution will only handle a subset of the possible HTML and will prove extremely fragile if the HTML ever changes. You might like to take a look at the section "How not to parse HTML" in chapter 8 of Data Munging with Perl. -- <http://www.dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re: Multi-Line Regex's by Nemp (Pilgrim) on Sep 18, 2002 at 12:17 UTC
Hi, I speak from personal experience when I say if I were you I'd avoid the regex's for this problem entirely and use some kind of HTML extraction module instead - makes it a lot easier and you don't need to worry about multi-line elements. Try looking up HTML::Parser, HTML::TokeParser, HTML::TagFilter, HTML::TokeParser::Simple or something similar HTH, Neil	[reply]
Re: Multi-Line Regex's by sch (Pilgrim) on Sep 18, 2002 at 12:14 UTC
phew, played around a bit more and I've managed to get to: `$_ = $response->content; m/GENERAL SYNOPSIS AT ([0-9]{4})<BR>\n((^.+<BR>\n)+)<P>/mi; print "===> ".$1." ".$2."\n";` [download] which gives me the output of ===> 0100 LOW SOUTH FITZROY 1000 MOVING SLOWLY NORTH AND FILLING 1006 BY 0100<BR> TOMORROW. NEW LOW EXPECTED 50 MILES WEST OF TRAFALGAR 1007 BY SAME<BR> TIME. HIGH 100 MILES WEST OF ROCKALL 1023 SLOW MOVING AND DECLINING<BR> 1021 BY THAT TIME<BR> which is what I wanted. But is that the best regex to get that result?	[reply] [d/l]
Re: Multi-Line Regex's by zaimoni (Beadle) on Sep 19, 2002 at 02:04 UTC
You omitted the `s` modifier on the regex: it should end with `mis` rather than `mi`. That turns your single-line regex into a multi-line regex. I normally use that modifier on general principles.	[reply] [d/l] [select]