When I try this:
my $html = '<html><font class="copy">Name<br />Address<br />Country</f +ont></html>'; use HTML::TokeParser; my $p2 = new HTML::TokeParser(\$html ); $p2->{textify}{br} = sub { "\n-\n" }; # substitution text for the "br +" tag while (my $token = $p2->get_tag("font")) { my $text = $p2->get_text("/font"); print $text; }
using the HTML::TokeParser that came with ActivePerl5.8.4 (2.28), I get
Name - Address - Country
which seems usable to me. Using an older version (2.05, though that's actually irrelevant), I get
Name<br />Address<br />Country
but the cause of that appears to be that its HTML::Parser doesn't grok the XHTML style tags: "<br />". With "<br>" both versions act the same.

Anyway, with the newest version and using get_trimmed_text(), I get

Name - Address - Country
so even my own substitution text gets the same treatment as any other plain text in the HTML, replacing my own inserted newlines with a space. That behaviour doesn't look too useful to me.

How must one distinguish between the "just whitespace" newlines in the HTML and the meaningful ones representing the linebreak tag? You can't. Bah.


In reply to Re: Using TokeParser with embedded tags by bart
in thread Using TokeParser with embedded tags by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.