dstar has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to parse html into text (as HTML::FormatText does), but retain _some_ of the formatting -- namely, I'd like to keep the color changes (IE, have curses color codes to change the color at that point, and back to whatever it was before afterwards)

Anyone know of a module that can do this? Or am I going to have to parse the HTML myself?

Shalon Wood
  • Comment on HTML --> text formatting with curses color codes?

Replies are listed 'Best First'.
Re: HTML --> text formatting with curses color codes?
by Popcorn Dave (Abbot) on Apr 07, 2005 at 19:40 UTC
    If I understand correctly what you're after, you might want to look at HTML::TokeParser to do what you're after. Parsing HTML in to tokens will give you all the bold, italics and color information and you should be able to take it from there.

    Don't try to parse the HTML yourself, you'll drive yourself nuts.

    HTH!

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
Re: HTML --> text formatting with curses color codes?
by tlm (Prior) on Apr 07, 2005 at 19:50 UTC

    Don't parse the HTML yourself! HTML::Parser lets you pass to the parser your own handlers for various parsing events; it would not be too hard to write handlers to do what you need.

    the lowliest monk

Re: HTML --> text formatting with curses color codes?
by Fletch (Bishop) on Apr 07, 2005 at 23:53 UTC

    You might consider that (depending on the input) you're not going to get colour information from just the HTML. You'd have to also take CSS into consideration as well, both from any inline style attributes as well as any linked stylesheets. There is a CSS module on CPAN, but I've never used it so I don't know how applicable it'd be to this application.

    But having said that, what you'd want to do is use something like HTML::TokeParser (or ...::Simple depending on personal taste) to walk through the document. When you encounter a new element, look at the style, class, and id attributes and figure out (using whatever CSS gives you) what the colour should be. You'll probably need to somehow convert that colour down to something you can display using Term::ANSIColor or Curses. Keep the current colour in a variable, and a stack of previous colours in an array. When you change the color, push the old current onto the stack; when you come to the end of that element, pop off the old and switch back to it.

    Update: Oop, you'd probably want to push the end tag and the colour to restore onto the stack. When you get an end tag from the parser, look at the end tag on the top of the stack to determine if you want to pop or not.

      I'm fairly sure there's no css involved in this part; it's the display for a web-based chat. I had to write my own client for it since links can't deal with server-push.

      I'm using a curses module (a hacked-up version of Term::Visual to add a third pane down the right side to display currently-logged-in users), so I'll need to convert it down to curses colors. I'm not sure how to do that yet; I guess I'll just break the colors down into ranges, each range mapping to a curses color.

      Shalon Wood