Filter Links in HTML-documents

stefan k has asked for the wisdom of the Perl Monks concerning the following question:

Der Monks,

this has probably been asked before but I could not find a good solution for this. sorry!

The Problem is: I need to filter all the links in a HTML document through a routine which decides what to do. Therefore it takes a) the kind of tag b) the link in there (e.g. href in a or background in td) and c) some kind of replacement for a part of the old link.

I've been fiddling with HTML::Filter and HTML::Parser but I got to admit that I don't fully understand the mechanism of overwriting some of their subs. The "filtered_html" example from the HTML::Filter manpage seemed interesting to me, but I didn't manage to set it up.

Thanks for any help and/or clue and/or tip and RESPECT Stefan K

$dom = "skamphausen.de";   ##   May The Open Source Be With You!
$Mail = "mail@$dom;  $Url = "http://www.$dom";
[download]

Comment on Filter Links in HTML-documents Download Code

Replies are listed 'Best First'.
Re: Filter Links in HTML-documents by davorg (Chancellor) on Dec 19, 2000 at 21:08 UTC
You may find that the interfaces to HTML::LinkExtor or HTML::TreeBuilder are easier to understand for the kind of task that you're doing. p.s. You mention having to overload the methods of HTML::Parser. In the new version (3.x) this is no longer necessary, so it might be worth upgrading. -- <http://www.dave.org.uk> "Perl makes the fun jobs fun and the boring jobs bearable" - me	[reply]
Re: Filter Links in HTML-documents by ichimunki (Priest) on Dec 19, 2000 at 22:25 UTC
While I generally agree that reading something merlyn wrote is a great idea, I found the following example in the perldoc for HTML::TokeParse `This example extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <A>...</A> tags: use HTML::TokeParser; $p = HTML::TokeParser->new(shift\|\|"index.html"); while (my $token = $p->get_tag("a")) { my $url = $token->[1]{href} \|\| "-"; my $text = $p->get_trimmed_text("/a"); print "$url\t$text\n"; }` [download]	[reply] [d/l]
Re: Filter Links in HTML-documents by merlyn (Sage) on Dec 19, 2000 at 21:05 UTC
I've got a sample of that in my tarring up a tree WT column. -- Randal L. Schwartz, Perl hacker	[reply]