Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All

I've searched google and read through my Cookbook text but can't put together a way of doing what I need to do, I hope someone may be able to help me.

I would like to use Perl on the command line to strip out a section of CODE from a HTML file, for example, I would like to strip out ...

<!-- REMOVESTART -->
test test test
<!-- REMOVEEND -->


So I would like to remove the comment tags and what lies between them from a HTML file, problem is this inner content is always changing, contains other HTML <> tags etc..

Thanks so much for any help!

Replies are listed 'Best First'.
Re: Perl Command Line Regex
by johngg (Canon) on Nov 12, 2006 at 21:52 UTC
    If your start and end tags are easily identified it could be as simple as

    $ perl -ne 'print unless /<!-- REMOVESTART -->/ .. /<!-- REMOVEEND --> +/;' somefile.html > newfile.html

    You may need to do /\Q ... \E/ if your tags contain regular expression metacharacters that would need to be escaped. If your start and end tags are more difficult to identify then an HTML parser would be more appropriate.

    Cheers,

    JohnGG

Re: Perl Command Line Regex
by Fletch (Bishop) on Nov 12, 2006 at 21:10 UTC

    Well, since you haven't showed what you have tried no one's going to know how you've gone wrong so far. At any rate:

    • Unless you've got very, very simple HTML you don't want to try to parse it with a regex
    • For this case you may find that HTML::TokeParser::Simple or HTML::TokeParser is going to work the best (copy things verbatim as you go along; stop printing when you see the start comment, start printing again at the end tag)
Re: Perl Command Line Regex
by rminner (Chaplain) on Nov 12, 2006 at 21:46 UTC
    Hi,
    Fletch is right, that it's usually not a good idea to do SGML/HTML/XML parsing using perl regex. But in case you really want to do it, the following snippet will remove the two tags, and the text in between.
    #!/usr/bin/perl use strict; use warnings; my $string = <<EOFDATA; whatever <!-- REMOVESTART --> test test test <!-- REMOVEEND --> you are trying to do EOFDATA my $start_tag = quotemeta('<!-- REMOVESTART -->'); my $end_tag = quotemeta('<!-- REMOVEEND -->'); $string =~ s{$start_tag.*?$end_tag\n*}{}gms; print $string;