SpacemanSpiff has asked for the wisdom of the Perl Monks concerning the following question:

How do you specify the tags to textify? Here's the 411 from the manpages:

The text might span tags that should be textified. This is controlled by the $p->{textify} attribute, which is a hash that defines how certain tags can be treated as text. If the name of a start tag matches a key in this hash then this tag is converted to text. The hash value is used to specify which tag attribute to obtain the text from. If this tag attribute is missing, then the upper case name of the tag enclosed in brackets is returned, e.g. "IMG". The hash value can also be a subroutine reference. In this case the routine is called with the start tag token content as its argument and the return value is treated as the text.

The default $p->{textify} value is: {img => "alt", applet => "alt"}. This means that <IMG> and <APPLET> tags are treated as text, and that the text to substitute can be found in the ALT attribute.

Ok, so I'm using the following command to grab the text between the previous fetched tag and the next </table> tag:

my $text = $stream->get_text ("/table");
I want the script to ignore all <br> tags within the retreived text, but wipe out the rest of the HTML. After reading the above, the best option in my case is to use textify (HTML is naturally wiped out with Tokeparser). The question is, how do I specify the tags I want ignored?  $text->{textify}("br");? Can someone more familiar with this command set help me out?

Thankyas!

Replies are listed 'Best First'.
Re: Tokeparser Textify Command
by Aristotle (Chancellor) on Nov 10, 2005 at 04:15 UTC

    Seems like you need

    $text->{textify} = { br => '' };
    This is nothing specific to the textify feature, it’s simply an anonymous hash. See perlreftut.

    Makeshifts last the longest.

      wouldn't $text->{textify} = { br => '' }; swap br tags with a blank space? i've tried that as well as a few variations (like putting br in the single quotes or just having br alone in the curly brackets), all to no avail.

        The documentation you just quoted clearly states that the value of the key specifies which attribute the replacement text should be taken from (so by default, it replaces <img> tags by the text in their alt attribute). Since you want no replacement text, an empty string should be the appropriate choice. (Not that it matters, since <br> has no attributes to pick replacement text out of.)

        You could be more specific about “no avail” – what is happening and how does it contradict your expectations?

        Makeshifts last the longest.