Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am writing a script in Perl for stripping the HTML code along with Javascript. It should remove the comments in each code. The file will be like,
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <!-- testing test--> "<!-- test -->" <body> <script type="text/javascript"> document.write('<h2>This is a header</h2>');"/* testing */" document.write('<p>/*hello*/This is a paragraph</p>'); /* sdkfjhsdf +hsdfhsdjkfhsjd fhsjdh fdjs sdfdh sfjh sdfhsd jhsdf hsdf*/ /* testing* +/ // hello this is a comment line /* CHEC This too */ "/*test /*test*/test*//*hello*/" alert("//hello"); '// This is for testing' alert("hello"); // This is for testing' "/* gdjkfghdf gdflkg jdfklgjdfkjgdfkl */" '"/* gdjkfghdf gdflkg jdfk6lgjdfkjgdfkl */' /* hello this is multiline multiline comment */ </script> <!-- fjghfdj ghjfdghjhg fgdfgdfgklfj klfg klfd flkgjhfd jkghf fgfdlkgjdfg --> <div align="center"> This is for testing.<br> Welcome to INDIA<br> <p> "<!-- hai comment -->" HI TESTING </p> <strike>this for testing<br> </strike> <center><!-- adasdasdasdasdas --> "<!-- aksdja +sdjaskdjaks"djaksdj"askd aksdjak -->" centralizing the string</center +> <input type=button name='but' value='check'/> </body> </html>
Desired Output:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> "" <body> <script type="text/javascript"> document.write('<h2>This is a header</h2>');"/* testing */" document.write('<p>/*hello*/This is a paragraph</p>'); "/*test /*test*/test*//*hello*/" alert("//hello"); '// This is for testing' alert("hello"); "/* gdjkfghdf gdflkg jdfklgjdfkjgdfkl gjkdfjgdkfgjdkfgjdfjgdfg dfg +fdg */" '"/* gdjkfghdf gdflkg jdfklgjdfkjgdfkl gjkdfjgdkfgjdkfgjdfjgdfg dfg + fdg */' </script> <div align="center"> This is for testing.<br> Welcome to INDIA<br> <p> "" HI TESTING </p> <strike>this for testing<br> </strike> <center> "" centralizing the string</center> <input type=button name='but' value='check'/> </body> </html>
Can any one give me a regular expression to fulfill my requirement or any other way to do my work... Thanks in advance....

Replies are listed 'Best First'.
Re: HTML stripper...
by Anonymous Monk on Nov 22, 2010 at 05:56 UTC

      A dense list of URLs like that and no RickRoll? I'm shocked, SHOCKED!

      ...roboticus

        Awooga. Awooga. Awooga! Censure. Censure. Censure. You shouted.

      Hi all, I think you people are misunderstood my question.. I don't want to touch any html tags(stripping). I want to remove only the comments in html and javascript at a time... Thanks...
        I want to remove only the comments in html and javascript at a time... Thanks...

        So use one of the solutions from previously linked , but configure them to only remove comments and javascript

Re: HTML stripper...
by JavaFan (Canon) on Nov 22, 2010 at 11:23 UTC
    It should remove the comments in each code.
    That's easy. However, if your goal is to remove comments, and nothing else, you cannot do it with a simple regexp. You will have to fully tokenize both HTML and Javascript (and partially parse HTML and Javascript to know when the document switches language). What is a comment, after all, is context sensitive.

      Indeed.   You are probably going to go straight to something like HTML::Parser ... a true parser that can invoke action-routines when a particular construct has been recognized, no matter what it actually takes to recognize the presence of a construct.

Re: HTML stripper...
by kcott (Archbishop) on Nov 22, 2010 at 08:22 UTC

    This regex will remove valid HTML comments:

    s{<!-- \s ( . (?!-- \s* >) )* \s -- \s* >}{}gmsx

    Possibly a typo in mocked-up test HTML, but this is not a valid HTML comment:

    <!-- testing test-->

    It's invalid because there's no space before -->.

    Here's the section of the W3C HTML Recommendation dealing with the syntax of HTML comments.

    You've also posted comments that seem to indicate that you want Javascript removed but your sample output doesn't bear that out. Please clarify this point.

    Update:

    There appears to be some disagreement over what constitutes a valid HTML comment.

    I used the following code to test my solution:

    #!perl use 5.12.0; use warnings; { local $/ = undef; open my $fh, '<', $ARGV[0] or die $!; (my $html = <$fh>) =~ s{<!-- \s ( . (?!-- \s* >) )* \s -- \s* >}{ +}gmsx; close $fh; say $html; }

    This produced the OP's "Desired Output" with the exception of

    <!-- testing test-->

    remaining in the output.

    I then checked the W3C reference document (linked above) which states:

    HTML comments have the following syntax:

    <!-- this is a comment --> <!-- and so is this one, which occupies more than one line -->

    Note the whitespace between comment and --> in both cases. Also note that the documentation makes no further reference to whitespace in that position.

    If anyone has more definitive information (e.g. Backus-Naur Form notation), a link to that would be useful and welcome.

    For the OP: to also remove that remaining comment, regardless of whether it's valid or not, just change the \s to \s* in the regex:

    s{<!-- \s ( . (?!-- \s* >) )* \s* -- \s* >}{}gmsx

    -- Ken

      It's invalid because there's no space before -->
      That's bogus. There's no need for space to be there. Nor does there have to be space as the first character following a COM sequence (COM being --).

      OTOH, your pattern falsely considers <!-- -- --> to be a valid comment, while it doesn't consider <!-- <!-- --> --> to be valid.

      This matches HTML comments:

      <!(?:--(?:[^-]*(?:-[^-]+)*)--\s*)*>
      although if you are truely pedantic, you'd replace the \s with the set of characters the HTML DTD defines as white space characters.

        Firstly, I've added an update to my post, please read that.

        Secondly, rather than just stating "That's bogus ...", perhaps you could cite a reference.

        -- Ken

      Note the whitespace between comment and --< in both cases. Also note that the documentation makes no further reference to whitespace in that position.
      Note also that all the examples you cite lack capital letters. And I'm pretty sure the documentation makes no further mention of capital letters in comments - with your logic, they're forbidden. In fact, following your logic, there are only two HTML comments: the examples from the documentation.
      Did you even read what you linked to? There is nothing in there about requiring a space before the closing dashes. The only whitespace rules they mention are basically:
      Legal: "<!--" Illegal: "<! --" -and- Legal: "-->" Legal: "-- >"

      Elda Taluta; Sarks Sark; Ark Arks

        "Did you even read what you linked to?"

        That's fairly unpleasant, bordering on rudeness.

        Note the <!-- (at the start of the regex) and the -- \s* > (at the end of the regex) which deals with those rules.

        Also take a look at my updated post which indicates more of what I read.

        -- Ken

Re: HTML stripper...
by Anonymous Monk on Nov 22, 2010 at 05:10 UTC
    I think I would call that a "comment stripper", since it strips off comments, not markup.