Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Removing XML comments with regex

by gamache (Friar)
on Oct 24, 2007 at 19:15 UTC ( [id://646974]=note: print w/replies, xml ) Need Help??


in reply to Removing XML comments with regex

The --> end tag is not allowed within comments, so your job is easy:
$xml =~ s/<!--.*?-->//sg;
(Update: Corrected to provide minimal match)

Replies are listed 'Best First'.
Re^2: Removing XML comments with regex
by moritz (Cardinal) on Oct 24, 2007 at 19:31 UTC
    Test this regexes with foo <!-- bar --> baz <!-- qox --> blurb - it removes too much.

    You could do something like this: $xml =~ s/<!--.*?-->//g;

    Update: You can use the fact that '--' may not occur in xml comments:

    $xml =~ s/<!--(?:.(?<!--))*-->//g;

    Sadly lookarounds are error prone (from the programmer's side), so don't trust this regex unless you've tested it carefully. I don't think there is a big speed gain in it (if at all), I hope I find the tuits to benchmark it.

      And then that'll fail for comments that span lines. But you can tweak that by slurping the entire file in, but then someone else will find another corner case that breaks that . . .

      If you're parsing XML, use a proper parser. Unless you can guarantee a very specific input format any attempt using solely regexen is going to have problems. It's not like there aren't 19 bazillion different off-the-shelf XML parsing solutions available out there which will handle all the nastiness for you.

      Update: And no, saying there's another corner case is not "FUD".

      <?xml version="1.0"?> <root><![CDATA[ <!-- OMGWTFBBQ --> ]]></root>

      Update: Or to make the breakage more explicit:

      <?xml version="1.0"?> <root><![CDATA[ <!-- OMGWTFBBQ ]]>Shoulda used a <!-- real parser -->< +/root>
      A reply falls below the community's threshold of quality. You may see it by logging in.
      You're right about the minimal match on .*?, but keep /s in there or it will miss multiline comments.
      $xml =~ s/<!--.*?-->//sg;
        Right on,
        s/<!--.*?-->//gs
        is the proper way to go, since we don't want to be greedy and treat the beginning of the first comment until the end of the last comment as a giant comment itself.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://646974]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-26 04:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found