Multiple Multiline Regexps?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Well I guess this is kinda trivial, but i've been running around in circles probably overlooking some significant hints. I have a HTML document with an arbitrary amount of (tag)-constructs that follow a certain schema (which is known). Imagine the following structure:

[...]
<body>
[...random stuff...]
<li>headline one</li>
<br>
<p>the story</p>
[...random stuff...]
<li>headline two</li>
<br>
<p>the next story</p>
[...random stuff...]
<body>
[download]

Now I'd like to traverse that string, multilinematching each headline and story into $1/$2 or similar and do this for each story. Any hints on how to accomplish this? Thanks in advance.

Comment on Multiple Multiline Regexps? Download Code

Replies are listed 'Best First'.
Re: Multiple Multiline Regexps? by DamnDirtyApe (Curate) on Jul 25, 2002 at 18:37 UTC
This might get you started: `#! /usr/bin/perl use strict ; use warnings ; $\|++ ; my $data = qq{ [...] <body> [...random stuff...] <li>headline one</li> <br> <p>the story</p> [...random stuff...] <li>headline two</li> <br> <p>the next story</p> [...random stuff...] <body> } ; while ( $data =~ s{<li>(.?)</li>.?<p>(.?)</p>}{}s ) { print "Headline: $1\nStory: $2\n\n" ; } __END__` [download] That is, of course, assuming that the only use for `<li>` and `<p>` are only used for headlines and stories. IMO, the more restrictive you can make this regexp, the better. Update:* This is probably better done with a proper parser. I've never used it, but HTML::Parser might be a good option. _______________ D a m n D i r t y A p e Home Node \| Email	[reply] [d/l] [select]
Re: Re: Multiple Multiline Regexps? by Bird (Pilgrim) on Jul 25, 2002 at 18:57 UTC
Agreed, I'd be as restrictive as possible. I'd even add the `\n<br>\n` portion to the regex. Something like this (which also just matches, instead of substituting)... `while ( $data =~ m{<li>(.?)</li>\n\s<br>\n\s<p>(.?)</p>}g )` [download] -Bird p.s. The `\s*` assertions are in there to deal with leading spaces. I don't know if there are any in your data, but DamnDirtyApe had some in his code.	[reply] [d/l] [select]
Re: Multiple Multiline Regexps? by cfreak (Chaplain) on Jul 25, 2002 at 18:39 UTC
For something like this I'd suggest you use HTML::TokeParser which has an excellent tutorial: HERE. I think you'll find it to be far easier and probably somewhat more effeceint. Hope that helps Chris Lobster Aliens Are attacking the world!	[reply]
Yum, Tag Soup... by BorgCopyeditor (Friar) on Jul 25, 2002 at 19:39 UTC
...just like Grandma used to code. :) FWIW, that's only HTML by analogy. All `<li>` elements are supposed to be children of `<ul>` or `<ol>` elements, which parent elements are not supposed to contain anything but `<li>` elements (though the latter can contain anything legal for other block-level elements). I only mention this because if you start using more sophisticated HTML parsing modules (especially one that tries to build the document tree, if there is such a thing), the example markup could give them indigestion. For more info, you might run your stuff through the validator. BCE --Your punctuation skills are insufficient!	[reply] [d/l] [select]