Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: Removing XML comments with regex

by moritz (Cardinal)
on Oct 24, 2007 at 19:31 UTC ( [id://646979]=note: print w/replies, xml ) Need Help??


in reply to Re: Removing XML comments with regex
in thread Removing XML comments with regex

Test this regexes with foo <!-- bar --> baz <!-- qox --> blurb - it removes too much.

You could do something like this: $xml =~ s/<!--.*?-->//g;

Update: You can use the fact that '--' may not occur in xml comments:

$xml =~ s/<!--(?:.(?<!--))*-->//g;

Sadly lookarounds are error prone (from the programmer's side), so don't trust this regex unless you've tested it carefully. I don't think there is a big speed gain in it (if at all), I hope I find the tuits to benchmark it.

Replies are listed 'Best First'.
Re^3: Removing XML comments with regex
by Fletch (Bishop) on Oct 24, 2007 at 19:39 UTC

    And then that'll fail for comments that span lines. But you can tweak that by slurping the entire file in, but then someone else will find another corner case that breaks that . . .

    If you're parsing XML, use a proper parser. Unless you can guarantee a very specific input format any attempt using solely regexen is going to have problems. It's not like there aren't 19 bazillion different off-the-shelf XML parsing solutions available out there which will handle all the nastiness for you.

    Update: And no, saying there's another corner case is not "FUD".

    <?xml version="1.0"?> <root><![CDATA[ <!-- OMGWTFBBQ --> ]]></root>

    Update: Or to make the breakage more explicit:

    <?xml version="1.0"?> <root><![CDATA[ <!-- OMGWTFBBQ ]]>Shoulda used a <!-- real parser -->< +/root>
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re^3: Removing XML comments with regex
by gamache (Friar) on Oct 24, 2007 at 19:37 UTC
    You're right about the minimal match on .*?, but keep /s in there or it will miss multiline comments.
    $xml =~ s/<!--.*?-->//sg;
      Right on,
      s/<!--.*?-->//gs
      is the proper way to go, since we don't want to be greedy and treat the beginning of the first comment until the end of the last comment as a giant comment itself.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://646979]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-19 03:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found