I have been racking my brain lately trying to write a script
that compares many files that I have downloaded and "detemplates" them. So that two files gentreated by the same script using the same template would generate the information that was given to the script to generate the files. This is all part of my plan to take over the world and to help web spiders to have to store less information.
I have looked into several methods including abusing the diff
program in different ways. Unfortunatly I havn't been able to get inforamtion out of diff that isn't line based. (You would be suprised how many websites now days contain html with no line breaks) And also I would like to be able to get recuring patterns out of a page.
My current solution involves comparing chunks of html trees from HTML::TreeBuilder in different ways but I have yet to see any success with that.
Has anyone out there delt with this problem before or am I inventing a new wheel? If so what was there solution and where can I look at some source code? If not do any hints come to your mind?