Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a very very large offline collection of HTML content that I would like to manipulate.

Each page of this collection has comment tags that delineate where the site's advertising banners go, sidebars, navigation, and so on.

An example page looks like this (pseudo, to protect the site owner):

[...] lots of beginning HTML dreck, doctype, Javascript banner ad rotation stuff, etc. <!-- END MAIN HEADER CODE --> I'd like to throw everything above the comment tag above, away, keeping everything below this point, up to the end of the bottom navigation elements. <!-- TOP CHAPTER/SECTION NAV CODE --> HTML for content-specific navigation in the body of the page. I want to keep this <!-- END TOP CHAPTER/SECTION NAV CODE --> <!-- BEGIN CHAPTERTITLE --> Title of the body of content I want to keep is here <!-- END CHAPTERTITLE --> Some other related HTML goes here, I need to keep this <!-- BEGIN CHAPTER --> The actual content itself is here, I need to keep this also <!-- END CHAPTER --> <!-- BOTTOM CHAPTER/SECTION NAV CODE --> Duplicated elements of content navigation here, same as on the top of the content itself. I'd like to keep this too. <!-- END BOTTOM CHAPTER/SECTION NAV CODE --> Everything after this last comment tag, I need to throw away.

Basically I need to rip the center of the page out, stripping off the top of the HTML content (sitewide navigation), and the bottom of the HTML content (the banners)

The content I want is bordered by the HTML comments, which are known, above.

How can I do this in an automated fashion, without resorting to a multiple-pass Perl one-liner?

Replies are listed 'Best First'.
Re: Pulling the "stomach" out of a directory of HTML files
by dragonchild (Archbishop) on Oct 17, 2004 at 17:15 UTC
    perl -ni -e 'next unless /START/ .. /END/;print' *.html

    That should work. I tested it on a very basic file. I would recommend doing this on a copy and it will change the files in-place. (You could do perl -ni.bak -e '...' *.html and that will leave the old file in X.html.bak)

    Also, this will keep START and END in the file. That may or may not be what you want.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      Just in case, the code to exclude the START and END is shown below.

      perl -ni -e 'next unless $x = /START/ .. /END/; print if $x !~ /^1$|E0 +$/' *.html

      Hope that this helps. Read perldoc perlop for more details on the flip-flop (scalar range) operator.

      Cheers,

      -- Dave :-)


      $q=[split+qr,,,q,~swmi,.$,],+s.$.Em~w^,,.,s,.,$&&$$q[pos],eg,print
Re: Pulling the "stomach" out of a directory of HTML files
by TedPride (Priest) on Oct 17, 2004 at 23:39 UTC
    Assuming there will always be a match for those two flags, and that there are only one of each of those flags in each page:
    ($text) = $text =~ /<!-- END MAIN HEADER CODE -->(.*?)<!-- END BOTTOM +CHAPTER\/SECTION NAV CODE -->/s;
    Or, as someone pointed out on CB, you can also use index / rindex / substr:
    my $start = '<!-- END MAIN HEADER CODE -->'; my $end = '<!-- END BOTTOM CHAPTER/SECTION NAV CODE -->'; my $index = index($text, $start) + length($start); $text = substr($text, $index, rindex($text, $end) - $index);