nmerriweather has asked for the wisdom of the Perl Monks concerning the following question:

i'm trying to scrape everything within 2 tags on a web page , and failing miserably.
<li><span class="title">Title</span> MATCH HERE </li>
usually i just do a simple match like
<li><span class="[^"]+">[^<]+</span>([^<]+)</li>
but i'm running into 2 problems:

a_ i can have html in the matched area, which screws up my ability to do a simple stop-match on the < . i've been failing with lookahead/lookbehind. i read the chapters in mastering regex several dozen times, and every time i think i understand these 2 beasts, I realize i dont.

b_ my plan-of-attack is screwed up when i encounter a nested <li>.*</li> tag. i'd like to not use an html tree module to handle this -- and keep it all in regex. is this possible?

Replies are listed 'Best First'.
Re: simple regex help
by wfsp (Abbot) on Apr 18, 2007 at 16:57 UTC
    hi nmerriweather,

    i'd like to not use an html tree module to handle this -- and keep it all in regex. is this possible?
    In a word yes. But, imo, very tricky.

    You didn't say what exactly you're looking for. Perhaps some examples, including those nested <li>s?

    It's very easy to "get at" all the html elements. I'd wager a solution could be found using something like the following.

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do{local $/; <DATA>}; my $p = HTML::TokeParser::Simple->new(\$html) or die "can't parse string: $!\n"; while (my $t = $p->get_token){ printf "*%s*\n", $t->as_is; } __DATA__ <li><span class="title">Title</span> MATCH HERE </li>
    output:
    *<li>* *<span class="title">* *Title* *</span>* * MATCH HERE * *</li>* * *
      <quote>You didn't say what exactly you're looking for. Perhaps some examples, including those nested
    • s?</quote>

      Well, anything in the 'match here' -- the content changes. the only given i know, is that the outermost match is this:

      <li><span class="title">Title</span> MATCH HERE </li>

      MATCH here could be a single letter, or it could be an html structure that potentially matches the regex

      i really need to keep this in regex if possible -- using the tree objects is a last resort

        I've used a stack to keep track of opening/closing li tags.
        #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do{local $/; <DATA>}; my $p = HTML::TokeParser::Simple->new(\$html) or die "can't parse string: $!\n"; while (my $t = $p->get_token){ last if $t->is_end_tag('span'); } my ($match, @li_stack); while (my $t = $p->get_token){ if ($t->is_start_tag('li')){ push @li_stack, 'li'; } if ($t->is_end_tag('li')){ if (@li_stack){ pop @li_stack; } else{ last; } } $match .= $t->as_is; } print "$match\n"; __DATA__ <li><span class="title">Title</span><ul><li>one</li><li>two</li></ul> +MATCH HERE </li>
        output:
        <ul><li>one</li><li>two</li> MATCH HERE
        update:
        Added output.

        uptdate 2
        see ikegami's reply below.

Re: simple regex help
by ikegami (Patriarch) on Apr 18, 2007 at 17:14 UTC
    • Complication: The closing LI tag is optional.
    • Simplification: An LI element can only be closed by a few tags, they will always close an LI element, and one of them must be present. They are: </LI>, <LI>, </OL> and </UL>.
    • Simplification: OL, UL and LI elements cannot be placed inside of a SPAN element.
    • Complication: OL, UL and (indirectly) LI elements can be placed inside of an LI element.
    our $ul_or_ol; local $ul_or_ol = qr{ (?: <ul \b (?: (??{ $ul_or_ol }) | (?! </ul \b ). )* </ul \b [^>]* > | <ol \b (?: (??{ $ul_or_ol }) | (?! </ol \b ). )* </ol \b [^>]* > ) }xi; my $re = qr{ <li \b [^>]* > <span class="[^"]+"> [^<]+ </span> ( (?: $ul_or_ol | (?! < (?:li|/li|/ol|/ul) \b ). )* ) (?: </li \b [^>]* > | (?= < (?:li|/ol|/ul) \b ) ) }xi;
    • Bug/Assumption: The above can fail if there's a < or a > inside an attribute value, a comment, a script or a style.
    • Bug/Assumption: The above can fail if some valid SGML constructs are used. Fortunately, noone uses them.
    • Bug: Unmaintainable.
    • Note: It can be simplified if more assumptions are made. It can be optimized as well.

    Untested.

    Update: Fixed to handle nested lists in the "MATCH HERE" portion.

Re: simple regex help
by ikegami (Patriarch) on Apr 18, 2007 at 19:43 UTC
    It's super easy to find the desired LI using XPath.
    use HTML::TreeBuilder qw( ); use HTML::TreeBuilder::XPath qw( ); my $html = do{local $/; <DATA>}; my $tree = HTML::TreeBuilder->new_from_content($html); my @results = $tree->findnodes( '//li/*[1][name()="span"]/@class/parent::*/parent::*' ); foreach my $li (@results) { # $li holds the HTML::Element object for the LI. ... } $tree->delete();

    It's a little bit tricky to serialize the LI without the LI start tag, the SPAN element and the LI end tag.

    use HTML::Entities qw( encode_entities ); sub node_as_html { my ($node) = @_; if (ref($node)) { my $html = $node->as_HTML(undef, undef, {}); chomp($html); return $html; } else { return encode_entities($node); } } ... my @children = $li->content_list(); shift(@children); # Skip SPAN print(node_as_html($_)) foreach @children; print("\n"); ...
Re: simple regex help
by Moron (Curate) on Apr 18, 2007 at 17:19 UTC
    You should expect to process tags recursively. You could use XML::Parser or one of the XML wrappers for that to do it for you. If you roll your own, the area of logic you are stumbling on should first check for a closing tag of the current tag. Otherwise any other '<' should invoke a recursive call to get a nested tag. '</current-tag-name>' should be just before returning from the recursive sub that gets tags.
    __________________________________________________________________________________

    ^M Free your mind!