Intaglio has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm writing a script that will make pages displayable on Palm Pilots and basically what I'm wanting to do is remove all ubiquitous HTML tags except for hyperlinks and text. I know that I'll have to open up specified .html documents and then basically strip them of everything save what I wrote above.

I really don't have any idea on how to start this; if anyone could lend me a hand I'd appreciate it.

--Intaglio

Replies are listed 'Best First'.
Re: Removing HTML Tags?
by davorg (Chancellor) on Dec 19, 2000 at 21:29 UTC

    You best bet would be HTML::Parser or one of its subclasses. I'm betting that HTML::TreeBuilder would be your best bet in this case.

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

Re: Removing HTML Tags?
by Blue (Hermit) on Dec 19, 2000 at 22:04 UTC
    You've already gotten some good anvice on your question, hopefully I can give you a few points to think about on questions you didn't ask.

    Remember HTML tags that 'serve a purpose' but aren't links or text. If you have a link, and it has an IMG instead of text, you're going to need to put something so there is a link for your viewers. Perhaps for all images use the ALT tag if it exists (or the name if it doesn't) so that the context of the picture can be kept. Look at various HTML for the blind sites on the web for other ideas about translating multi-media pages into more simple forms. Most sites are no longer lynx compatible.

    Also, much infomation can be coded into tags. One example is a table, especially one with some blank boxes. Just displaying the text can loose much context.

    You also need to figure out what to do with frames - perhaps a top-of page marker that there are frames and the ability to flip through them.

    Since you're talking about stripping HTML, I'm assuming that this means that you are not going to sites written just for your Palm program. Look hard at what tags can be dropped safely and which need to have some analog.

    Good luck.

    =Blue
    ...you might be eaten by a grue...

Re: Removing HTML Tags?
by I0 (Priest) on Dec 19, 2000 at 21:28 UTC
    See
    perldoc -q HTML
Re: Removing HTML Tags? (code in Snippets Section)
by hotyopa (Scribe) on Dec 20, 2000 at 02:53 UTC