in reply to How to extract untouched content of html tag with HTML::Parser

Lana:

I've not used it in a while, but as I read the documentation, I'd suggest passing "text" rather than "dtext" to the handler specification so it can print the original text rather than the decoded text.

...roboticus

  • Comment on Re: How to extract untouched content of html tag with HTML::Parser

Replies are listed 'Best First'.
Re^2: How to extract untouched content of html tag with HTML::Parser
by Lana (Beadle) on Nov 28, 2010 at 16:11 UTC
    I wish it was that simple :) But it isn't :(
      It is that easy. You have a logic error. Your start handler, which you call start_handler, does no printing. You text handler does printing, but as documented, the text handler handles text not start tags. Also, your end handler does no printing.
        OMG!!! I can't believe I was that blind! Thank you very much! :))

      OK, then, did you look at the htstrip example in the distribution? The documentation (at the end of the EXAMPLES section) indicates that you can modify it to do what you want:

      More examples are found in the eg/ directory of the HTML-Parser distribution: the program hrefsub shows how you can edit all links found in a document; the program htextsub shows how to edit the text only; the program hstrip shows how you can strip out certain tags/elements and/or attributes; and the program htext show how to obtain the plain text, but not any script/style content.

      ...roboticus

        Yes I did examined all examples and played with them alot. But still can't get what I need. I can't understand why using 'text' instead of 'dtext' produces the same result - plain text instead of returning untouched content of that HTML tag...