Re: Using TokeParser with embedded tags

When I try this:

my $html = '<html><font class="copy">Name<br />Address<br />Country</f
+ont></html>';
use HTML::TokeParser;
my $p2 = new HTML::TokeParser(\$html );
$p2->{textify}{br} = sub { "\n-\n" };  # substitution text for the "br
+" tag 
while (my $token = $p2->get_tag("font"))
{
    my $text = $p2->get_text("/font");
    print $text;

}
[download]

using the HTML::TokeParser that came with ActivePerl5.8.4 (2.28), I get

Name
-
Address
-
Country
[download]

which seems usable to me. Using an older version (2.05, though that's actually irrelevant), I get

Name<br />Address<br />Country
[download]

but the cause of that appears to be that its HTML::Parser doesn't grok the XHTML style tags: "<br />". With "<br>" both versions act the same.

Anyway, with the newest version and using get_trimmed_text(), I get

Name - Address - Country
[download]

so even my own substitution text gets the same treatment as any other plain text in the HTML, replacing my own inserted newlines with a space. That behaviour doesn't look too useful to me.

How must one distinguish between the "just whitespace" newlines in the HTML and the meaningful ones representing the linebreak tag? You can't. Bah.

Comment on Re: Using TokeParser with embedded tags Select or Download Code