Do you know what the .html looks like? Or are you working against the full range of (semi-reasonable) possibilities?
If the former, provide a sample, so that we may better understand your purpose/intent (as noted above, there are bits and pieces of your question that are less_than_clear
re removal of <pre.../pre> and charentity spaces: webmaster probably put'em there for a reason. If you're merely extracting content to .txt or a db or some such, webmaster's intent implies no requirement for you; but if you have some web-ish or "rendered" use in mind, beware
re s/<blockquote.*?\/blockquote>$//s
Fixing that regex is comparatively easy (if you mean what I suspect), but -- if you come to frequent the monastery, you will read often re parsing html with handrolled code: "DON'T!". Instead search on html parse for relevant nodes pointing to the modules that will serve you well... ...AND