Re^2: Parsing HTML files

totally agree that scraping html is quite bad and unstable. my rough guide for scraping, from most to least desireable

don't scrape...if you can find out if there's a RESTful way to get the results instead, via some API or alternate (e.g. XML format) url
if the html is well formed (i.e. xhtml) then it will be almost guaranteed to be well structured, will proper closing tags etc...so use one of the XML based parsers. you can easily get to specific elements via a well defined hierarchy once parsed
this stuff gets ugly when you start slurping whole html into a scalar and progressively find markers where to suck bits into desired variables for inspection. even here you can code a bit defensively by picking sane markers, e.g. "id" or "class" elements. never anchor to text containing inline html styles or other bits that are likely to change fequently, like inline javascript or such.

finally, if you end up producing html in the CGI, don't mix actual output with styling. write a stylesheet instead. producing a well formed xhtml document in the CGI without inline styles, provides later opportunity to use the output of the CGI via another CGI or whatever. it's also much easier to change the output of the CGI via a stylesheet, rather than digging in perl code.
there are frameworks for doing even fancier scraping, where you end up running a browser engine server side, to pretend that your program is a browser. this is necessary when a website dynamically produces most of it's output with javascript. and naturally because javascript is browser/client side code, you won't see the results of that unless you run it. this is pretty horrid stuff. although you can do automatic login and traverse a website and results...it typically breaks as soon as absolutely anything changes on the website. A good/helpful website, even if dynamically fancy rendered with javascript, should provide a RESTful api to get data out. But some companies still insist on not being very helpful.

the hardest line to type correctly is: stty erase ^H

Comment on Re^2: Parsing HTML files