stefan k has asked for the wisdom of the Perl Monks concerning the following question:

Der Monks,

this has probably been asked before but I could not find a good solution for this. sorry!

The Problem is: I need to filter all the links in a HTML document through a routine which decides what to do. Therefore it takes a) the kind of tag b) the link in there (e.g. href in a or background in td) and c) some kind of replacement for a part of the old link.

I've been fiddling with HTML::Filter and HTML::Parser but I got to admit that I don't fully understand the mechanism of overwriting some of their subs. The "filtered_html" example from the HTML::Filter manpage seemed interesting to me, but I didn't manage to set it up.

Thanks for any help and/or clue and/or tip and RESPECT Stefan K

$dom = "skamphausen.de"; ## May The Open Source Be With You! $Mail = "mail@$dom; $Url = "http://www.$dom";

Replies are listed 'Best First'.
Re: Filter Links in HTML-documents
by davorg (Chancellor) on Dec 19, 2000 at 21:08 UTC

    You may find that the interfaces to HTML::LinkExtor or HTML::TreeBuilder are easier to understand for the kind of task that you're doing.

    p.s. You mention having to overload the methods of HTML::Parser. In the new version (3.x) this is no longer necessary, so it might be worth upgrading.

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

Re: Filter Links in HTML-documents
by ichimunki (Priest) on Dec 19, 2000 at 22:25 UTC
    While I generally agree that reading something merlyn wrote is a great idea, I found the following example in the perldoc for HTML::TokeParse
    This example extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <A>...</A> tags: use HTML::TokeParser; $p = HTML::TokeParser->new(shift||"index.html"); while (my $token = $p->get_tag("a")) { my $url = $token->[1]{href} || "-"; my $text = $p->get_trimmed_text("/a"); print "$url\t$text\n"; }
Re: Filter Links in HTML-documents
by merlyn (Sage) on Dec 19, 2000 at 21:05 UTC