Using TokeParser with embedded tags

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to process the following html -

html code
...
<font class="copy">Name<br />Address<br />Country</font>
...
html code
[download]

using this code -

my $p2 = new HTML::TokeParser(\$res->{_content} );
while (my $token = $p2->get_tag("font")) 
{
    my $text = $p2->get_trimmed_text("/font");


}
[download]

but the last line above returns -
NameAddressCountry
instead of -
Name<br />Address<br />Country
I want to place the Name Address and Country in each of their own variables, but I cant split the returned string because it is missing the delimiting br tags.
Any suggestions?

Comment on Using TokeParser with embedded tags Select or Download Code

Replies are listed 'Best First'.
Re: Using TokeParser with embedded tags by Ovid (Cardinal) on Aug 22, 2004 at 15:54 UTC
What version of HTML::TokeParser and HTML::Parser do you have? This works: `my $html = '<font class="copy">Name<br />Address<br />Country</font>'; my $p2 = new HTML::TokeParser(\$html); while (my $token = $p2->get_tag("font")) { my $text = $p2->get_trimmed_text("/font"); print $text; }` [download] That prints `Name Address Country`. Also note that HTML::Parser parses texts in 512K chunks. Is there a possibility that your actual program is hitting this barrier? See the `$parser->unbroken_text` method in the `HTML::Parser` docs. Cheers, Ovid New address of my CGI Course.	[reply] [d/l]
Re^2: Using TokeParser with embedded tags by Baz (Friar) on Aug 22, 2004 at 17:18 UTC
`my $token = $p2->get_tag("font") $name = $p2->get_text(); $token = $p2->get_tag("br"); my $address = $p2->get_text(); $token = $p2->get_tag("br"); my $country = $p2->get_text();` [download]	[reply] [d/l]
Re: Using TokeParser with embedded tags by bart (Canon) on Aug 23, 2004 at 01:52 UTC
When I try this: `my $html = '<html><font class="copy">Name<br />Address<br />Country</f +ont></html>'; use HTML::TokeParser; my $p2 = new HTML::TokeParser(\$html ); $p2->{textify}{br} = sub { "\n-\n" }; # substitution text for the "br +" tag while (my $token = $p2->get_tag("font")) { my $text = $p2->get_text("/font"); print $text; }` [download] using the HTML::TokeParser that came with ActivePerl5.8.4 (2.28), I get `Name - Address - Country` [download] which seems usable to me. Using an older version (2.05, though that's actually irrelevant), I get `Name<br />Address<br />Country` [download] but the cause of that appears to be that its HTML::Parser doesn't grok the XHTML style tags: "`<br />`". With "`<br>`" both versions act the same. Anyway, with the newest version and using get_trimmed_text(), I get `Name - Address - Country` [download] so even my own substitution text gets the same treatment as any other plain text in the HTML, replacing my own inserted newlines with a space. That behaviour doesn't look too useful to me. How must one distinguish between the "just whitespace" newlines in the HTML and the meaningful ones representing the linebreak tag? You can't. Bah.	[reply] [d/l] [select]