in reply to Re^5: Removing XML comments with regex
in thread Removing XML comments with regex

Looks like you left off exactly before XSTL started to loos breath. I tried the same with a 30MB XML and it's still busy after five minutes. expat took 9s. 8.5s if I do not copy stuff from @_.

I'm not surprised XML::Twig was doing poorly in this test. This is exactly the wrong task for the module. It does have to parse the whole XML into a maze of objects and then is forced to stringify it again immediately.

I'll post here the results after xslt finishes. And will run the results for a few different sizes of the XML.

Update: XSLT just finished after more than 12 minutes.

Update 2: here are the results, looks like XSLT is doing worse than expected at least on my computer:
  1MB 2MB 3MB 4MB 5MB 10MB 15MB 20MB 30MB 30MB/30
expat.pl 0.30489 0.613526 0.922916 1.4342 1.802648 3.046866 4.878799 6.116883 9.050448 0.3016816
expat2.pl 0.28943 0.5687 0.857198 1.129559 1.425147 2.819807 4.237721 5.88283 8.505741 0.2835247
twig.pl 1.213592 2.406564 3.598347 4.79183 6.051234 12.554845 19.845587 26.74665 crash 1.3373325(*)
twig2.pl 1.10568 2.213843 3.307619 4.474567 5.619435 11.919388 18.024169 24.710547 crash 1.23552735(*)
xslt.pl 0.255876 0.80157 1.724701 2.540583 5.730051 22.825532 184.643654 248.187591 726.690777 24.2230259

The expat2.pl is the same as your expat.pl except that it uses @_ directly, twig2.pl is like twig.pl, but misses the pretty_print => 'indented'. Maybe you have more memory. I have XML::Parser::Expat 2.34, XML::Twig 3.19, XML::LibXML 1.56 and XML::LibXSLT 1.53. And Windows Vista Home Pro with 3GB of memory (I have to admit I did not close all apps).

(*) as it crashed on me for 30MB XML, this is the time per 1MB for the 20MB XML.

#expat.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Parser::Expat; my $parser = new XML::Parser::Expat; $parser->setHandlers('Start' => \&sh, 'End' => \&eh, 'Char' => \&ch); $parser->parsefile('tmp.xml'); sub sh { my ($p, $e) = @_; print $p->recognized_string(); } sub eh { my ($p, $e) = @_; print $p->recognized_string(); } sub ch { my ($p, $s) = @_; print $s; } __END__ #expat2.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Parser::Expat; my $parser = new XML::Parser::Expat; $parser->setHandlers('Start' => \&sh, 'End' => \&eh, 'Char' => \&ch); $parser->parsefile('tmp.xml'); sub sh { print $_[0]->recognized_string(); } sub eh { print $_[0]->recognized_string(); } sub ch { print $_[1]; } __END__ #twig.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Twig; my $twig = XML::Twig->new (comments => 'drop', pretty_print => 'indent +ed'); $twig->parsefile("tmp.xml"); $twig->print(); __END__ #twig2.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Twig; my $twig = XML::Twig->new (comments => 'drop'); $twig->parsefile("tmp.xml"); $twig->print(); __END__ #xslt.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::LibXML; use XML::LibXSLT; my $parser = XML::LibXML->new(); my $xslt = XML::LibXSLT->new(); my $source = $parser->parse_file('tmp.xml'); my $style_doc = $parser->parse_string(<<EOT); <?xml version='1.0'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" versi +on="1.0"> <xsl:strip-space elements="*"/> <xsl:output method="xml" indent="yes"/> <xsl:template match="@* | node()"> <xsl:copy> <xsl:apply-templates select="@* | node()" /> </xsl:copy> </xsl:template> <xsl:template match="comment()" /> </xsl:stylesheet> EOT my $stylesheet = $xslt->parse_stylesheet($style_doc); my $results = $stylesheet->transform($source); print $stylesheet->output_string($results); __END__ #make_tmp.pl #make_tmp.pl my $cnt = eval(shift()); print <<'*END*'; <?xml version="1.0" encoding="UTF-8"?> <Node_A> *END* while ($cnt--) { print <<'*END*'; <!-- One Line Comment --> <Node_B>content <!-- Two Line Comment Two Line Comment--> <Node_C> </Node_C> <!-- One Line Comment --> <!-- Multi Line Comment Line 3Comment 1Line Comment 2Line Comment Line 5Comment Line Comment--> </Node_B> *END* } print <<'*END*'; </Node_A> *END* __END__

make_tmp.pl 3459 creates a 1MB file.