One thing that might help you is to use modules that can parse the HTML. HTML::Tree::Scanning is an article that explains how to do scanning with HTML::Parser and HTML::TreeBuilder

Just a wild idea: An algorithm to find the repeated structures might work, but humans are so much better at that sort of pattern matching. Since complete website redesigns are not that common, you might instead give some special user the power to call for a listing of the website html structure and to interactively mark the start of the repeatable structure and the element with the product name. That would work only if that user knows html a bit, but would make updating a changed website relatively easy and fast.


In reply to Re: Search for repeating but slightly different patterns by jethro
in thread Search for repeating but slightly different patterns by Sly_G

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.