in reply to simple regex help

hi nmerriweather,

i'd like to not use an html tree module to handle this -- and keep it all in regex. is this possible?
In a word yes. But, imo, very tricky.

You didn't say what exactly you're looking for. Perhaps some examples, including those nested <li>s?

It's very easy to "get at" all the html elements. I'd wager a solution could be found using something like the following.

#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do{local $/; <DATA>}; my $p = HTML::TokeParser::Simple->new(\$html) or die "can't parse string: $!\n"; while (my $t = $p->get_token){ printf "*%s*\n", $t->as_is; } __DATA__ <li><span class="title">Title</span> MATCH HERE </li>
output:
*<li>* *<span class="title">* *Title* *</span>* * MATCH HERE * *</li>* * *

Replies are listed 'Best First'.
Re^2: simple regex help
by nmerriweather (Friar) on Apr 18, 2007 at 17:08 UTC
    <quote>You didn't say what exactly you're looking for. Perhaps some examples, including those nested
  • s?</quote>

    Well, anything in the 'match here' -- the content changes. the only given i know, is that the outermost match is this:

    <li><span class="title">Title</span> MATCH HERE </li>

    MATCH here could be a single letter, or it could be an html structure that potentially matches the regex

    i really need to keep this in regex if possible -- using the tree objects is a last resort

      I've used a stack to keep track of opening/closing li tags.
      #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do{local $/; <DATA>}; my $p = HTML::TokeParser::Simple->new(\$html) or die "can't parse string: $!\n"; while (my $t = $p->get_token){ last if $t->is_end_tag('span'); } my ($match, @li_stack); while (my $t = $p->get_token){ if ($t->is_start_tag('li')){ push @li_stack, 'li'; } if ($t->is_end_tag('li')){ if (@li_stack){ pop @li_stack; } else{ last; } } $match .= $t->as_is; } print "$match\n"; __DATA__ <li><span class="title">Title</span><ul><li>one</li><li>two</li></ul> +MATCH HERE </li>
      output:
      <ul><li>one</li><li>two</li> MATCH HERE
      update:
      Added output.

      uptdate 2
      see ikegami's reply below.

        __DATA__ <li><span class="title">Title</span><ul><li>one</ul> MATCH HERE </li> +this shouldn't match

        outputs

        <ul><li>one</ul> MATCH HERE </li> this shouldn't match

        instead of the expected

        <ul><li>one</ul> MATCH HERE