assuming you (the developer) know ahead of processing which site will produce which variant of presentation...you can have unique code (per site) to scrape and translate into a stable design internal representation, and from there on, your code always just works with the internal representation. so in effect this strategy de-couples the initial read of the html from the rest of the code. the hard part is determining which form of internal representation (data structure) will work for all cases, and give you consistent access to that data.
the hardest line to type correctly is: stty erase ^H