Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Well I guess this is kinda trivial, but i've been running around in circles probably overlooking some significant hints. I have a HTML document with an arbitrary amount of (tag)-constructs that follow a certain schema (which is known). Imagine the following structure:
[...] <body> [...random stuff...] <li>headline one</li> <br> <p>the story</p> [...random stuff...] <li>headline two</li> <br> <p>the next story</p> [...random stuff...] <body>
Now I'd like to traverse that string, multilinematching each headline and story into $1/$2 or similar and do this for each story. Any hints on how to accomplish this? Thanks in advance.

Replies are listed 'Best First'.
Re: Multiple Multiline Regexps?
by DamnDirtyApe (Curate) on Jul 25, 2002 at 18:37 UTC

    This might get you started:

    #! /usr/bin/perl use strict ; use warnings ; $|++ ; my $data = qq{ [...] <body> [...random stuff...] <li>headline one</li> <br> <p>the story</p> [...random stuff...] <li>headline two</li> <br> <p>the next story</p> [...random stuff...] <body> } ; while ( $data =~ s{<li>(.*?)</li>.*?<p>(.*?)</p>}{}s ) { print "Headline: $1\nStory: $2\n\n" ; } __END__

    That is, of course, assuming that the only use for <li> and <p> are only used for headlines and stories. IMO, the more restrictive you can make this regexp, the better.

    Update: This is probably better done with a proper parser. I've never used it, but HTML::Parser might be a good option.


    _______________
    D a m n D i r t y A p e
    Home Node | Email
      Agreed, I'd be as restrictive as possible. I'd even add the \n<br>\n portion to the regex. Something like this (which also just matches, instead of substituting)...

      while ( $data =~ m{<li>(.*?)</li>\n\s*<br>\n\s*<p>(.*?)</p>}g )

      -Bird

      p.s. The \s* assertions are in there to deal with leading spaces. I don't know if there are any in your data, but DamnDirtyApe had some in his code.

Re: Multiple Multiline Regexps?
by cfreak (Chaplain) on Jul 25, 2002 at 18:39 UTC
Yum, Tag Soup...
by BorgCopyeditor (Friar) on Jul 25, 2002 at 19:39 UTC

    ...just like Grandma used to code. :)

    FWIW, that's only HTML by analogy. All <li> elements are supposed to be children of <ul> or <ol> elements, which parent elements are not supposed to contain anything but <li> elements (though the latter can contain anything legal for other block-level elements).

    I only mention this because if you start using more sophisticated HTML parsing modules (especially one that tries to build the document tree, if there is such a thing), the example markup could give them indigestion.

    For more info, you might run your stuff through the validator.

    BCE
    --Your punctuation skills are insufficient!