I have been racking my brain lately trying to write a script that compares many files that I have downloaded and "detemplates" them. So that two files gentreated by the same script using the same template would generate the information that was given to the script to generate the files. This is all part of my plan to take over the world and to help web spiders to have to store less information.

I have looked into several methods including abusing the diff program in different ways. Unfortunatly I havn't been able to get inforamtion out of diff that isn't line based. (You would be suprised how many websites now days contain html with no line breaks) And also I would like to be able to get recuring patterns out of a page.

My current solution involves comparing chunks of html trees from HTML::TreeBuilder in different ways but I have yet to see any success with that.

The Question:

Has anyone out there delt with this problem before or am I inventing a new wheel? If so what was there solution and where can I look at some source code? If not do any hints come to your mind?

-Douglas

In reply to Internet Information Extractor by GermanHerman

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.