in reply to extracting data from HTML

In my view, you have only a very-few options, and all of them depend upon the original data source:

  1. If at all possible, change the source.   If you are drawing data from a web-page owned by someone you are friendly to (i.e. they will not view your actions as “scraping their databases,”), then negotiate with them for a better feed.   Maybe they have a SOAP interface; maybe they can build one. </l>
  2. If the HTML has a consistent structure, then you can parse it meaningfully.   But the structure has to be very meaningful.
  3. If not, you have to use regular-expressions to recognize the “wheat” within the “chaff” of data.   I have personally used that approach with Parse::RecDescent to extract data from thousands of SAS files, Korn shell scripts and Tivoli Workload Scheduler schedules.   You must identify the “wheat” that contains data as well as enough of the “wheat” to establish context, then build a “forgiving” grammar.   It wasn’t easy.