in reply to Stripping tags from a PerlMonks page.
{ open(PAGE, "$PageToParse.html") or die "Could not open: $!\n"; local $/; while (<PAGE>) { m#<TITLE>(.*)</TITLE># and $title=$1; s#^.*?</TABLE>##s; s#<!-- nodelets start.*##s; print $_; ## Or to a new file, etc. } }
Short, ugly, and to the point. The Title is the only thing of value I can see keeping up until the end of the first TABLE tag. Jettison all that, jettison everything after the nodelets, and add back in stuff like the title, <BODY>, </BODY>, etc. as you desire.
|
|---|