Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^5: Removing XML comments with regex

by runrig (Abbot)
on Dec 28, 2007 at 20:59 UTC ( [id://659408]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Removing XML comments with regex
in thread Removing XML comments with regex

As a very rough benchmark, I created a ~20MB xml file (I took the doc in the OP, and copied the middle part over and over). I filtered the doc using XSLT (using XML::LibXSLT, and the XSLT in the node above), and XML::Twig (the solution elsewhere in this thread). The XSLT took about 3 seconds, the XML::Twig took about 20. Both used vast quantities of memory.

And though I'm not familiar with XML::Parser::Expat, I hacked together something which seems to work (though I am likely missing something for some types of XML content), and ran in about 3 seconds without using much memory at all.

Update: repeated XSLT with 40MB file, took just a few more seconds. I wonder if Jenda is using XML::XSLT or XML::LibXSLT below. (and an 80MB file took ~25 secs w/XSLT and ~20 w/expat) (XML::LibXSLT 1.62, XML::LibXML 1.63). Here is what I used:
xslt.pl: #!/usr/bin/perl use strict; use warnings; use XML::LibXML; use XML::LibXSLT; my $parser = XML::LibXML->new(); my $xslt = XML::LibXSLT->new(); my $source = $parser->parse_file('tmp.xml'); my $style_doc = $parser->parse_string(<<EOT); <?xml version='1.0'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" versi +on="1.0"> <xsl:strip-space elements="*"/> <xsl:output method="xml" indent="yes"/> <xsl:template match="@* | node()"> <xsl:copy> <xsl:apply-templates select="@* | node()" /> </xsl:copy> </xsl:template> <xsl:template match="comment()" /> </xsl:stylesheet> EOT my $stylesheet = $xslt->parse_stylesheet($style_doc); my $results = $stylesheet->transform($source); print $stylesheet->output_string($results); ---------------------------------------------- twig.pl: #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $twig = XML::Twig->new (comments => 'drop', pretty_print => 'indent +ed'); $twig->parsefile("tmp.xml"); $twig->print(); ---------------------------------- expat.pl: #!/usr/bin/perl use strict; use warnings; use XML::Parser::Expat; my $parser = new XML::Parser::Expat; $parser->setHandlers('Start' => \&sh, 'End' => \&eh, 'Char' => \&ch); $parser->parsefile('tmp.xml'); sub sh { my ($p, $e) = @_; print $p->recognized_string(); } sub eh { my ($p, $e) = @_; print $p->recognized_string(); } sub ch { my ($p, $s) = @_; print $s; }

Replies are listed 'Best First'.
Re^6: Removing XML comments with regex
by Jenda (Abbot) on Dec 28, 2007 at 22:02 UTC

    Looks like you left off exactly before XSTL started to loos breath. I tried the same with a 30MB XML and it's still busy after five minutes. expat took 9s. 8.5s if I do not copy stuff from @_.

    I'm not surprised XML::Twig was doing poorly in this test. This is exactly the wrong task for the module. It does have to parse the whole XML into a maze of objects and then is forced to stringify it again immediately.

    I'll post here the results after xslt finishes. And will run the results for a few different sizes of the XML.

    Update: XSLT just finished after more than 12 minutes.

    Update 2: here are the results, looks like XSLT is doing worse than expected at least on my computer:
      1MB 2MB 3MB 4MB 5MB 10MB 15MB 20MB 30MB 30MB/30
    expat.pl 0.30489 0.613526 0.922916 1.4342 1.802648 3.046866 4.878799 6.116883 9.050448 0.3016816
    expat2.pl 0.28943 0.5687 0.857198 1.129559 1.425147 2.819807 4.237721 5.88283 8.505741 0.2835247
    twig.pl 1.213592 2.406564 3.598347 4.79183 6.051234 12.554845 19.845587 26.74665 crash 1.3373325(*)
    twig2.pl 1.10568 2.213843 3.307619 4.474567 5.619435 11.919388 18.024169 24.710547 crash 1.23552735(*)
    xslt.pl 0.255876 0.80157 1.724701 2.540583 5.730051 22.825532 184.643654 248.187591 726.690777 24.2230259

    The expat2.pl is the same as your expat.pl except that it uses @_ directly, twig2.pl is like twig.pl, but misses the pretty_print => 'indented'. Maybe you have more memory. I have XML::Parser::Expat 2.34, XML::Twig 3.19, XML::LibXML 1.56 and XML::LibXSLT 1.53. And Windows Vista Home Pro with 3GB of memory (I have to admit I did not close all apps).

    (*) as it crashed on me for 30MB XML, this is the time per 1MB for the 20MB XML.

Re^6: Removing XML comments with regex
by Anonymous Monk on Dec 28, 2007 at 21:02 UTC
    so I am hearing that xml;;twig is a performance pig.
      As Jenda mentions below, XML::Twig is the wrong solution for this problem (update: when exceptional performance is an issue -- or if you just mean memory consumption, that was my own fault, easily corrected by mirod below). I pretty much knew as much before I started, but it was an easily available solution from this thread, so I tried it.

        Maybe you and Jenda should avoid benchmarking tools that you don't really master.

        Your first attempt was perfectly valid as a solution when performance is not an issue. But if it becomes one, then loading the entire document in memory when XML::Twig is specifically designed to avoid this, is kinda lame don't you think?

        The code below is probably not faster than what you have, but a least it should not use too much memory.

        #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $t= XML::Twig->new( keep_spaces => 1, comments => 'drop', twig_handlers => { _all_ => sub { $_[0]->flush +}} ) ->parsefile( "test_comments.xml") ;

        I would rather say wrong tool than wrong solution. Which (as mirod's response shows) doesn't mean it's not possible to use XML::Twig efectively, but rather that it was designed for a different kind of tasks. Which definitely doesn't mean the module itself is bad. Far from that and sorry if my comment sounded that way.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://659408]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-16 14:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found