Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Removing XML comments with regex

by gasho (Beadle)
on Oct 24, 2007 at 19:07 UTC ( [id://646970]=perlquestion: print w/replies, xml ) Need Help??

gasho has asked for the wisdom of the Perl Monks concerning the following question:

Dear perl monks is there a quick regex way of removing comments from XML file, Bellow is an sample file Thanks
<?xml version="1.0" encoding="UTF-8"?> <Node_A> <!-- One Line Comment --> <Node_B> <!-- Two Line Comment Two Line Comment--> <Node_C> </Node_C> <!-- One Line Comment --> <!-- Multi Line Comment Line 3Comment 1Line Comment 2Line Comment Line 5Comment Line Comment--> </Node_B> </Node_A>
(: Life is short enjoy it :)

Replies are listed 'Best First'.
Re: Removing XML comments with regex
by GrandFather (Saint) on Oct 24, 2007 at 20:28 UTC

    Using the module (XML::Twig) really is the easiest way to do it:

    use strict; use warnings; use XML::Twig; my $xml = <<XML; <?xml version="1.0" encoding="UTF-8"?> <Node_A> <!-- One Line Comment --> <Node_B> <!-- Two Line Comment Two Line Comment--> <Node_C> </Node_C> <!-- One Line Comment --> <!-- Multi Line Comment Line 3Comment 1Line Comment 2Line Comment Line 5Comment Line Comment--> </Node_B> </Node_A> XML my $twig = XML::Twig->new (comments => 'drop', pretty_print => 'indent +ed'); $twig->parse ($xml); $twig->print ();

    Prints:

    <?xml version="1.0" encoding="UTF-8"?> <Node_A> <Node_B> <Node_C></Node_C> </Node_B> </Node_A>

    Perl is environmentally friendly - it saves trees
Re: Removing XML comments with regex
by eff_i_g (Curate) on Oct 24, 2007 at 20:53 UTC
    You can also use an identity transform in XSLT that's coupled with an exception:
    <?xml version='1.0'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" versi +on="1.0"> <xsl:output method="xml" indent="yes"/> <xsl:template match="@* | node()"> <xsl:copy> <xsl:apply-templates select="@* | node()" /> </xsl:copy> </xsl:template> <xsl:template match="comment()" /> </xsl:stylesheet>
        That's actually fairly simple and idiomatic (I recognized the match and copy everything bit right away, and the first part is verbage that goes on every stylesheet in some form). It's at least no worse than when people who don't know perl complain about perl.
        It's powerful, very easy to customize, extremely fast, and handles huge XML documents quite well :) And besides, it isn't simply outputting "Hello World." It's recursing through various structures/nodes to duplicate an entire document, allowing fine-tuned controls with the simple touch of XPath. Hurrah!

        I'd be curious to see how it benchmarks against some of Perl's XML modules. Perhaps it's overkill for something as simple as comment removal, perhaps not.
Re: Removing XML comments with regex
by artist (Parson) on Oct 24, 2007 at 19:56 UTC
    You might want to 'add' XML plugin for File::Comments from Michael Schilli . The module can strip all comments from a file. It has currently plugins for C,HTML, Java, Javascript, Perl, PHP etc.
    --Artist
Re: Removing XML comments with regex
by gasho (Beadle) on Oct 24, 2007 at 19:39 UTC
    Is this good enough ?
    while (<>) { $line=$_; chomp $line; $StartTag='<!--'; $EndTag='-->'; if (($line =~ m/\Q$StartTag\E/) .. ($line =~ m/\Q$EndTag\E/)){ $Flag=0; } elsif (($line =~/<!--/) || ($line =~/-->/)){ $Flag=0; } else { $Flag=1; } if($Flag){ print "$line\n"; } }
    (: Life is short enjoy it :)
      I did the following (barring corner cases as mentioned previously) and it seems to strip comments just fine:
      cat t.xml | perl -e '$/ = ""; $_ = <>; s/<!--.*?-->//gs; print;'
      which gave the output:
      <Node_A> <Node_B> <Node_C> </Node_C> </Node_B> </Node_A>
Re: Removing XML comments with regex
by gamache (Friar) on Oct 24, 2007 at 19:15 UTC
    The --> end tag is not allowed within comments, so your job is easy:
    $xml =~ s/<!--.*?-->//sg;
    (Update: Corrected to provide minimal match)
      Test this regexes with foo <!-- bar --> baz <!-- qox --> blurb - it removes too much.

      You could do something like this: $xml =~ s/<!--.*?-->//g;

      Update: You can use the fact that '--' may not occur in xml comments:

      $xml =~ s/<!--(?:.(?<!--))*-->//g;

      Sadly lookarounds are error prone (from the programmer's side), so don't trust this regex unless you've tested it carefully. I don't think there is a big speed gain in it (if at all), I hope I find the tuits to benchmark it.

        And then that'll fail for comments that span lines. But you can tweak that by slurping the entire file in, but then someone else will find another corner case that breaks that . . .

        If you're parsing XML, use a proper parser. Unless you can guarantee a very specific input format any attempt using solely regexen is going to have problems. It's not like there aren't 19 bazillion different off-the-shelf XML parsing solutions available out there which will handle all the nastiness for you.

        Update: And no, saying there's another corner case is not "FUD".

        <?xml version="1.0"?> <root><![CDATA[ <!-- OMGWTFBBQ --> ]]></root>

        Update: Or to make the breakage more explicit:

        <?xml version="1.0"?> <root><![CDATA[ <!-- OMGWTFBBQ ]]>Shoulda used a <!-- real parser -->< +/root>
        A reply falls below the community's threshold of quality. You may see it by logging in.
        You're right about the minimal match on .*?, but keep /s in there or it will miss multiline comments.
        $xml =~ s/<!--.*?-->//sg;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://646970]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-03-29 13:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found