in reply to Removing XML comments with regex

The --> end tag is not allowed within comments, so your job is easy:
$xml =~ s/<!--.*?-->//sg;
(Update: Corrected to provide minimal match)

Replies are listed 'Best First'.
Re^2: Removing XML comments with regex
by moritz (Cardinal) on Oct 24, 2007 at 19:31 UTC
    Test this regexes with foo <!-- bar --> baz <!-- qox --> blurb - it removes too much.

    You could do something like this: $xml =~ s/<!--.*?-->//g;

    Update: You can use the fact that '--' may not occur in xml comments:

    $xml =~ s/<!--(?:.(?<!--))*-->//g;

    Sadly lookarounds are error prone (from the programmer's side), so don't trust this regex unless you've tested it carefully. I don't think there is a big speed gain in it (if at all), I hope I find the tuits to benchmark it.

      And then that'll fail for comments that span lines. But you can tweak that by slurping the entire file in, but then someone else will find another corner case that breaks that . . .

      If you're parsing XML, use a proper parser. Unless you can guarantee a very specific input format any attempt using solely regexen is going to have problems. It's not like there aren't 19 bazillion different off-the-shelf XML parsing solutions available out there which will handle all the nastiness for you.

      Update: And no, saying there's another corner case is not "FUD".

      <?xml version="1.0"?> <root><![CDATA[ <!-- OMGWTFBBQ --> ]]></root>

      Update: Or to make the breakage more explicit:

      <?xml version="1.0"?> <root><![CDATA[ <!-- OMGWTFBBQ ]]>Shoulda used a <!-- real parser -->< +/root>
      A reply falls below the community's threshold of quality. You may see it by logging in.
      You're right about the minimal match on .*?, but keep /s in there or it will miss multiline comments.
      $xml =~ s/<!--.*?-->//sg;
        Right on,
        s/<!--.*?-->//gs
        is the proper way to go, since we don't want to be greedy and treat the beginning of the first comment until the end of the last comment as a giant comment itself.