Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I have a very very large offline collection of HTML content that I would like to manipulate.
Each page of this collection has comment tags that delineate where the site's advertising banners go, sidebars, navigation, and so on.
An example page looks like this (pseudo, to protect the site owner):
[...] lots of beginning HTML dreck, doctype, Javascript banner ad rotation stuff, etc. <!-- END MAIN HEADER CODE --> I'd like to throw everything above the comment tag above, away, keeping everything below this point, up to the end of the bottom navigation elements. <!-- TOP CHAPTER/SECTION NAV CODE --> HTML for content-specific navigation in the body of the page. I want to keep this <!-- END TOP CHAPTER/SECTION NAV CODE --> <!-- BEGIN CHAPTERTITLE --> Title of the body of content I want to keep is here <!-- END CHAPTERTITLE --> Some other related HTML goes here, I need to keep this <!-- BEGIN CHAPTER --> The actual content itself is here, I need to keep this also <!-- END CHAPTER --> <!-- BOTTOM CHAPTER/SECTION NAV CODE --> Duplicated elements of content navigation here, same as on the top of the content itself. I'd like to keep this too. <!-- END BOTTOM CHAPTER/SECTION NAV CODE --> Everything after this last comment tag, I need to throw away.
Basically I need to rip the center of the page out, stripping off the top of the HTML content (sitewide navigation), and the bottom of the HTML content (the banners)
The content I want is bordered by the HTML comments, which are known, above.
How can I do this in an automated fashion, without resorting to a multiple-pass Perl one-liner?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Pulling the "stomach" out of a directory of HTML files
by dragonchild (Archbishop) on Oct 17, 2004 at 17:15 UTC | |
by DaveH (Monk) on Oct 18, 2004 at 01:20 UTC | |
|
Re: Pulling the "stomach" out of a directory of HTML files
by TedPride (Priest) on Oct 17, 2004 at 23:39 UTC |