in reply to Using TokeParser with embedded tags
using the HTML::TokeParser that came with ActivePerl5.8.4 (2.28), I getmy $html = '<html><font class="copy">Name<br />Address<br />Country</f +ont></html>'; use HTML::TokeParser; my $p2 = new HTML::TokeParser(\$html ); $p2->{textify}{br} = sub { "\n-\n" }; # substitution text for the "br +" tag while (my $token = $p2->get_tag("font")) { my $text = $p2->get_text("/font"); print $text; }
which seems usable to me. Using an older version (2.05, though that's actually irrelevant), I getName - Address - Country
but the cause of that appears to be that its HTML::Parser doesn't grok the XHTML style tags: "<br />". With "<br>" both versions act the same.Name<br />Address<br />Country
Anyway, with the newest version and using get_trimmed_text(), I get
so even my own substitution text gets the same treatment as any other plain text in the HTML, replacing my own inserted newlines with a space. That behaviour doesn't look too useful to me.Name - Address - Country
How must one distinguish between the "just whitespace" newlines in the HTML and the meaningful ones representing the linebreak tag? You can't. Bah.
|
|---|