Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to process the following html -
html code ... <font class="copy">Name<br />Address<br />Country</font> ... html code

using this code -
my $p2 = new HTML::TokeParser(\$res->{_content} ); while (my $token = $p2->get_tag("font")) { my $text = $p2->get_trimmed_text("/font"); }

but the last line above returns -
NameAddressCountry
instead of -
Name<br />Address<br />Country
I want to place the Name Address and Country in each of their own variables, but I cant split the returned string because it is missing the delimiting br tags.
Any suggestions?

Replies are listed 'Best First'.
Re: Using TokeParser with embedded tags
by Ovid (Cardinal) on Aug 22, 2004 at 15:54 UTC

    What version of HTML::TokeParser and HTML::Parser do you have? This works:

    my $html = '<font class="copy">Name<br />Address<br />Country</font>'; my $p2 = new HTML::TokeParser(\$html); while (my $token = $p2->get_tag("font")) { my $text = $p2->get_trimmed_text("/font"); print $text; }

    That prints Name Address Country.

    Also note that HTML::Parser parses texts in 512K chunks. Is there a possibility that your actual program is hitting this barrier? See the $parser->unbroken_text method in the HTML::Parser docs.

    Cheers,
    Ovid

    New address of my CGI Course.

      my $token = $p2->get_tag("font") $name = $p2->get_text(); $token = $p2->get_tag("br"); my $address = $p2->get_text(); $token = $p2->get_tag("br"); my $country = $p2->get_text();
Re: Using TokeParser with embedded tags
by bart (Canon) on Aug 23, 2004 at 01:52 UTC
    When I try this:
    my $html = '<html><font class="copy">Name<br />Address<br />Country</f +ont></html>'; use HTML::TokeParser; my $p2 = new HTML::TokeParser(\$html ); $p2->{textify}{br} = sub { "\n-\n" }; # substitution text for the "br +" tag while (my $token = $p2->get_tag("font")) { my $text = $p2->get_text("/font"); print $text; }
    using the HTML::TokeParser that came with ActivePerl5.8.4 (2.28), I get
    Name - Address - Country
    which seems usable to me. Using an older version (2.05, though that's actually irrelevant), I get
    Name<br />Address<br />Country
    but the cause of that appears to be that its HTML::Parser doesn't grok the XHTML style tags: "<br />". With "<br>" both versions act the same.

    Anyway, with the newest version and using get_trimmed_text(), I get

    Name - Address - Country
    so even my own substitution text gets the same treatment as any other plain text in the HTML, replacing my own inserted newlines with a space. That behaviour doesn't look too useful to me.

    How must one distinguish between the "just whitespace" newlines in the HTML and the meaningful ones representing the linebreak tag? You can't. Bah.