trashtalker has asked for the wisdom of the Perl Monks concerning the following question:

I am using HTML::TreeBuilder to parse some HTML.
However, when I use "as_text()" on this html:
<p>The Cow Jumped<br>Over The Moon</p>

it turns into:
The Cow JumpedOver The Moon

How can I get it to return this instead:
The Cow Jumped Over The Moon

Replies are listed 'Best First'.
Re: HTML::Element and "<br>" tags
by ww (Archbishop) on Oct 10, 2010 at 19:03 UTC
    The deletion of <br> is no different than the deletion of the paragraph tags. Even the doc for the module (HTML::TreeBuilder) mentioned in your description of your problem (See ' ...and NOT just BTW') tells you quite plainly where to find the answer to this question:
    ...The methods inherited from HTML::Parser are used forbuilding the HTML tree, and the methods inherited from HTML::Element are what you use to scrutinize the tree. Besides this (HTML::TreeBuilder) documentation, you must also carefully read the HTML::Element documentation, and also skim the HTML::Parser documentation -- probably only its parse and parse_file methods are of interest.

    Loosely put (in the context of your question and of HTML::Element and HTML::Parser), since <br> is NOT "as_text" - it's html - neither it nor any substitute appears the output:

    $h->as_text(skip_dels => 1)
    Returns a string consisting of only the text parts of the element's descendants. (Emphasis supplied)

    ...and NOT just BTW,
    you've offered no code and your narrative cites a different module (HTML::TreeBuilder) than your title (HTML::Element). Those shortcomings make offering help more difficult than necessary.

    Edited: removed irrelevant detail from the second quote from the docs; emphasized the relevant part.

Re: HTML::Element and "<br>" tags
by Khen1950fx (Canon) on Oct 10, 2010 at 16:35 UTC
    Since you have a line break after Jumped, I did this with a line break:
    #!/usr/bin/perl use strict; use warnings; require HTML::TreeBuilder; my $file = <DATA>; my $tree = HTML::TreeBuilder->new->parse_file($file); require HTML::FormatText; my $formatter = HTML::FormatText->format_string( $file, leftmargin => 3, rightmargin => 72); print $formatter; __DATA__ <p>The Cow Jumped<br>Over The Moon</p>
Re: HTML::Element and "<br>" tags
by Anonymous Monk on Oct 10, 2010 at 14:26 UTC
    I haven't looked at how as_text strips or converts elements, but a simple solution is to: before calling that method, replace every br element with a text node containing just a space (or whatever is visually suitable). That might require walking the entire tree, unless TreeBuilder lets you hook into it with a callback or something.
Re: HTML::Element and "<br>" tags
by ambrus (Abbot) on Oct 11, 2010 at 13:19 UTC

    I found some code that does just that in a script I use. It uses the XML::Twig interface, for I prefer that to HTML::Tree. Here's the subroutine, plus some code to show how it should be invoked. Of course, you may need to also teach this about some elements other than br.

    use warnings; use strict; use 5.010; use XML::Twig; sub html_text { my($el) = @_; my $r = ""; for my $n ($el->descendants_or_self) { if ($n->is_text) { $r .= $n->trimmed_text; } elsif ("br" eq $n->gi) { $r .= "\n"; } } $r; } my $tw = XML::Twig->new; $tw->parse_html(q(<p>The Cow Jumped<br>Over the Moon</p>)); say html_text($tw->root);
Re: HTML::Element and "<br>" tags
by Anonymous Monk on Oct 11, 2010 at 13:08 UTC

    It’s true:   “tags” are removed, and <br&rt; is a tag.

    When faced with this problem... I cheated.   I used this regex: s/<br>/ /g ... which turns all of those tags into one blank space.

    “Well, on the one hand, it’s a butt-ugly hack that you would never want to tell your mother about.   But, on the other hand ... it works.”

      Oh, fudge.   Perlmonks logged me out.   You probably have to backslash-escape the left and right angle brackets.   (But you already knew that, didn’t you?)