Internet Information Extractor

GermanHerman has asked for the wisdom of the Perl Monks concerning the following question:

I have been racking my brain lately trying to write a script that compares many files that I have downloaded and "detemplates" them. So that two files gentreated by the same script using the same template would generate the information that was given to the script to generate the files. This is all part of my plan to take over the world and to help web spiders to have to store less information.

I have looked into several methods including abusing the diff program in different ways. Unfortunatly I havn't been able to get inforamtion out of diff that isn't line based. (You would be suprised how many websites now days contain html with no line breaks) And also I would like to be able to get recuring patterns out of a page.

My current solution involves comparing chunks of html trees from HTML::TreeBuilder in different ways but I have yet to see any success with that.

The Question:

Has anyone out there delt with this problem before or am I inventing a new wheel? If so what was there solution and where can I look at some source code? If not do any hints come to your mind?

-Douglas

Comment on Internet Information Extractor

Replies are listed 'Best First'.
Re: Internet Information Extractor by perrin (Chancellor) on Sep 08, 2003 at 19:33 UTC
Maybe HTML::Diff or Text::ParagraphDiff would help.	[reply]
Re: Re: Internet Information Extractor by GermanHerman (Sexton) on Sep 08, 2003 at 20:43 UTC
Both of these modules look great I don't understand why I didn't see them on cpan before. I have spent HOURS on cpan looking for this sort of thing. Thank you so much. -Douglas	[reply]
Re: Internet Information Extractor by adrianh (Chancellor) on Sep 08, 2003 at 19:33 UTC
Although it still has some rough edges Template::Extract is a really useful module for doing this sort of thing. You write a Template template that would generate the HTML, pass this and the raw HTML to Template::Extract, and get the data out the other end.	[reply]
•Re: Re: Internet Information Extractor by merlyn (Sage) on Sep 08, 2003 at 22:47 UTC
And don't forget the newest stepchild, Template::Generate, where you give the output of the template, the data that generated it, and it gives you back the template! -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]