in reply to Re: Many matches to an array
in thread Many matches to an array

Unfortunately that example does not work - unless the OP just wants the text between the div tokens. Div tags generally have all sorts of HTML between them, not just text. That example will lose it all. You can do it like this with HTML::Parser

{ package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; $self->{divs}->[-1] .= $origtext if $self->{dc}; if ( $tagname eq 'div' ) { push @{$self->{divs}}, ''; $self->{dc}++; } } sub end { my($self, $tagname, $origtext) = @_; $self->{dc}-- if $tagname eq 'div'; $self->{divs}->[-1] .= $origtext if $self->{dc}; } sub text { my($self, $origtext, $is_cdata) = @_; $self->{divs}->[-1] .= $origtext if $self->{dc}; } sub comment { my($self, $origtext) = @_; $self->{divs}->[-1] .= "<!--$origtext-->" if $self->{dc}; } } my $p = MyParser->new; $p->parse($content); # WARNING this array deref will die if we have not put anything # in (ie not divs) as we will try to deref an undefined value if ( exists $p->{divs} ) { print"($_)\n" for @{$p->{divs}}; undef $p->{divs}; # prevent leaks, and accumulating in $p object }

Try your example on this HTML

$content = ' <html> <div>foo <!-- comment here --> </div> <div id="foo">bar <a href="hello">somestuff</a> </div> </html> ';

cheers

tachyon

Replies are listed 'Best First'.
Re^3: Many matches to an array
by wfsp (Abbot) on Jul 15, 2004 at 12:28 UTC
    Yes, absolutely. And nested divs. And shed loads of whitespace. Which makes the regex route, IMHO, even more scary.
    I was attempting to make a point with a simple case. But, as you point out, it was probably more misleading than helpful.
    I must have a look at HTML::Parser. It appears to be the parser of choice 'round these parts.
    Thanks, wfsp

      HTML::TokeParser is just a wrapper on top of HTML::Parser. The example I gave is in API 2 callback style. You can also use the API 3 style new constructor but I find it is not really as clear. That bit of logic - increment a counter on open tag, decrement on close, print if counter on, don't if not, and fix the comments gets you a lot of mileage..... Toke::Parser is far easier to use, and much more pleasing to the eye but you loose some of the full power of Parser.

      cheers

      tachyon