Recursive HTML Tag Extraction

khanan has asked for the wisdom of the Perl Monks concerning the following question:

I want to write a Perl script that does two things:

1. Parse a set of HTML pages in different directories recursively and extract all <A HREF=.....> </A> tags into separate text files while maintaining the directory structure. The text files, unique to each html pages/file, have to reside in the same directories as the html pages. (Note that tags/links are to html pages).

2. Use the extracted tags from text files and download the html pages and save them to the respective directories.

I have managed to extract the tags but from and into a single file. The directory and file structure needs to be maintained. So theoretically I would enter the home directory and the script should do the link extraction and download recursively.

Thanks in advance for any tips or previously used code snippets.

Edited 2002-06-20 by mirod: changed title (was: Recursive HTML Tax Extraction) and added formating tags

Comment on Recursive HTML Tag Extraction Download Code

Replies are listed 'Best First'.
Re: Recursive HTML Tag Extraction by dws (Chancellor) on Jun 20, 2002 at 16:45 UTC
You can use HTML::LinkExtor to extract links, and LWP::mirror() to fetch a page and store it. If you use Super Search and search for "LinkExtor", you'll find examples that you can adapt.	[reply]
Re: Recursive HTML Tag Extraction by hacker (Priest) on Jun 20, 2002 at 16:39 UTC
Would HTML::LinkExtor be what you're looking for? Perhaps posting some code that you've got so far to help us understand your design would help.	[reply]
Re: Recursive HTML Tax Extraction by khanan (Initiate) on Jun 20, 2002 at 10:54 UTC
The tile should read Recursive HTML TAG extraction. Sorry about the typo. ;-)	[reply]