I have a very very large offline collection of HTML content that I would like to manipulate.

Each page of this collection has comment tags that delineate where the site's advertising banners go, sidebars, navigation, and so on.

An example page looks like this (pseudo, to protect the site owner):

[...] lots of beginning HTML dreck, doctype, Javascript banner ad rotation stuff, etc. <!-- END MAIN HEADER CODE --> I'd like to throw everything above the comment tag above, away, keeping everything below this point, up to the end of the bottom navigation elements. <!-- TOP CHAPTER/SECTION NAV CODE --> HTML for content-specific navigation in the body of the page. I want to keep this <!-- END TOP CHAPTER/SECTION NAV CODE --> <!-- BEGIN CHAPTERTITLE --> Title of the body of content I want to keep is here <!-- END CHAPTERTITLE --> Some other related HTML goes here, I need to keep this <!-- BEGIN CHAPTER --> The actual content itself is here, I need to keep this also <!-- END CHAPTER --> <!-- BOTTOM CHAPTER/SECTION NAV CODE --> Duplicated elements of content navigation here, same as on the top of the content itself. I'd like to keep this too. <!-- END BOTTOM CHAPTER/SECTION NAV CODE --> Everything after this last comment tag, I need to throw away.

Basically I need to rip the center of the page out, stripping off the top of the HTML content (sitewide navigation), and the bottom of the HTML content (the banners)

The content I want is bordered by the HTML comments, which are known, above.

How can I do this in an automated fashion, without resorting to a multiple-pass Perl one-liner?


In reply to Pulling the "stomach" out of a directory of HTML files by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.