in reply to Using TokeParser with embedded tags

When I try this:
my $html = '<html><font class="copy">Name<br />Address<br />Country</f +ont></html>'; use HTML::TokeParser; my $p2 = new HTML::TokeParser(\$html ); $p2->{textify}{br} = sub { "\n-\n" }; # substitution text for the "br +" tag while (my $token = $p2->get_tag("font")) { my $text = $p2->get_text("/font"); print $text; }
using the HTML::TokeParser that came with ActivePerl5.8.4 (2.28), I get
Name - Address - Country
which seems usable to me. Using an older version (2.05, though that's actually irrelevant), I get
Name<br />Address<br />Country
but the cause of that appears to be that its HTML::Parser doesn't grok the XHTML style tags: "<br />". With "<br>" both versions act the same.

Anyway, with the newest version and using get_trimmed_text(), I get

Name - Address - Country
so even my own substitution text gets the same treatment as any other plain text in the HTML, replacing my own inserted newlines with a space. That behaviour doesn't look too useful to me.

How must one distinguish between the "just whitespace" newlines in the HTML and the meaningful ones representing the linebreak tag? You can't. Bah.