GermanHerman has asked for the wisdom of the Perl Monks concerning the following question:

I have been racking my brain lately trying to write a script that compares many files that I have downloaded and "detemplates" them. So that two files gentreated by the same script using the same template would generate the information that was given to the script to generate the files. This is all part of my plan to take over the world and to help web spiders to have to store less information.

I have looked into several methods including abusing the diff program in different ways. Unfortunatly I havn't been able to get inforamtion out of diff that isn't line based. (You would be suprised how many websites now days contain html with no line breaks) And also I would like to be able to get recuring patterns out of a page.

My current solution involves comparing chunks of html trees from HTML::TreeBuilder in different ways but I have yet to see any success with that.

The Question:

Has anyone out there delt with this problem before or am I inventing a new wheel? If so what was there solution and where can I look at some source code? If not do any hints come to your mind?

-Douglas

Replies are listed 'Best First'.
Re: Internet Information Extractor
by perrin (Chancellor) on Sep 08, 2003 at 19:33 UTC
      Both of these modules look great I don't understand why I didn't see them on cpan before. I have spent HOURS on cpan looking for this sort of thing. Thank you so much. -Douglas
Re: Internet Information Extractor
by adrianh (Chancellor) on Sep 08, 2003 at 19:33 UTC

    Although it still has some rough edges Template::Extract is a really useful module for doing this sort of thing.

    You write a Template template that would generate the HTML, pass this and the raw HTML to Template::Extract, and get the data out the other end.