in reply to Re^2: simple regex help
in thread simple regex help

I've used a stack to keep track of opening/closing li tags.
#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = do{local $/; <DATA>}; my $p = HTML::TokeParser::Simple->new(\$html) or die "can't parse string: $!\n"; while (my $t = $p->get_token){ last if $t->is_end_tag('span'); } my ($match, @li_stack); while (my $t = $p->get_token){ if ($t->is_start_tag('li')){ push @li_stack, 'li'; } if ($t->is_end_tag('li')){ if (@li_stack){ pop @li_stack; } else{ last; } } $match .= $t->as_is; } print "$match\n"; __DATA__ <li><span class="title">Title</span><ul><li>one</li><li>two</li></ul> +MATCH HERE </li>
output:
<ul><li>one</li><li>two</li> MATCH HERE
update:
Added output.

uptdate 2
see ikegami's reply below.

Replies are listed 'Best First'.
Re^4: simple regex help
by ikegami (Patriarch) on Apr 18, 2007 at 17:52 UTC
    __DATA__ <li><span class="title">Title</span><ul><li>one</ul> MATCH HERE </li> +this shouldn't match

    outputs

    <ul><li>one</ul> MATCH HERE </li> this shouldn't match

    instead of the expected

    <ul><li>one</ul> MATCH HERE

      And that, class, is why all sane people use a properly tested HTML parser and don't try to roll their own with regexen . . .

      Update: Oh he is. Never mind me . . . %) Perhaps this is why sane people avoid having to parse HTML if they can avoid it. :)

        According to Wikipedia,

        In computer science and linguistics, parsing (more formally syntax analysis) is the process of analyzing a sequence of tokens to determine its grammatical structure with respect to a given formal grammar.

        While using a tokenizer is a step in the right direction, he did roll his own parser (the while loop).