in reply to Search for repeating but slightly different patterns

One thing that might help you is to use modules that can parse the HTML. HTML::Tree::Scanning is an article that explains how to do scanning with HTML::Parser and HTML::TreeBuilder

Just a wild idea: An algorithm to find the repeated structures might work, but humans are so much better at that sort of pattern matching. Since complete website redesigns are not that common, you might instead give some special user the power to call for a listing of the website html structure and to interactively mark the start of the repeatable structure and the element with the product name. That would work only if that user knows html a bit, but would make updating a changed website relatively easy and fast.

  • Comment on Re: Search for repeating but slightly different patterns

Replies are listed 'Best First'.
Re^2: Search for repeating but slightly different patterns
by thargas (Deacon) on Oct 25, 2010 at 12:33 UTC

    Having done this kind of thing a few times, let me suggest that IMHO, this is a case where regex is better than parsing. Ignoring sites continually changing their formats, the main problem I found was that there were a lot of web-sites using broken html, as in it wouldn't validate and thus parsing was problematic.

    This is one of those cases where, in theory, parsing is better, but it has to be very forgiving parsing, to the point that I found it unreliable. Disheartening.

      I'm with you on this. If all sites were made under strict html standard, web-world would be a much better place. So, regexp it is!