Looks like you left off exactly before XSTL started to loos breath. I tried the same with a 30MB XML and it's still busy after five minutes. expat took 9s. 8.5s if I do not copy stuff from @_.
I'm not surprised XML::Twig was doing poorly in this test. This is exactly the wrong task for the module. It does have to parse the whole XML into a maze of objects and then is forced to stringify it again immediately.
I'll post here the results after xslt finishes. And will run the results for a few different sizes of the XML.
Update: XSLT just finished after more than 12 minutes.
Update 2: here are the results, looks like XSLT is doing worse than expected at least on my computer:
| 1MB | 2MB | 3MB | 4MB | 5MB | 10MB | 15MB | 20MB | 30MB | 30MB/30 | |
|---|---|---|---|---|---|---|---|---|---|---|
| expat.pl | 0.30489 | 0.613526 | 0.922916 | 1.4342 | 1.802648 | 3.046866 | 4.878799 | 6.116883 | 9.050448 | 0.3016816 |
| expat2.pl | 0.28943 | 0.5687 | 0.857198 | 1.129559 | 1.425147 | 2.819807 | 4.237721 | 5.88283 | 8.505741 | 0.2835247 |
| twig.pl | 1.213592 | 2.406564 | 3.598347 | 4.79183 | 6.051234 | 12.554845 | 19.845587 | 26.74665 | crash | 1.3373325(*) |
| twig2.pl | 1.10568 | 2.213843 | 3.307619 | 4.474567 | 5.619435 | 11.919388 | 18.024169 | 24.710547 | crash | 1.23552735(*) |
| xslt.pl | 0.255876 | 0.80157 | 1.724701 | 2.540583 | 5.730051 | 22.825532 | 184.643654 | 248.187591 | 726.690777 | 24.2230259 |
The expat2.pl is the same as your expat.pl except that it uses @_ directly, twig2.pl is like twig.pl, but misses the pretty_print => 'indented'. Maybe you have more memory. I have XML::Parser::Expat 2.34, XML::Twig 3.19, XML::LibXML 1.56 and XML::LibXSLT 1.53. And Windows Vista Home Pro with 3GB of memory (I have to admit I did not close all apps).
(*) as it crashed on me for 30MB XML, this is the time per 1MB for the 20MB XML.
#expat.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Parser::Expat; my $parser = new XML::Parser::Expat; $parser->setHandlers('Start' => \&sh, 'End' => \&eh, 'Char' => \&ch); $parser->parsefile('tmp.xml'); sub sh { my ($p, $e) = @_; print $p->recognized_string(); } sub eh { my ($p, $e) = @_; print $p->recognized_string(); } sub ch { my ($p, $s) = @_; print $s; } __END__ #expat2.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Parser::Expat; my $parser = new XML::Parser::Expat; $parser->setHandlers('Start' => \&sh, 'End' => \&eh, 'Char' => \&ch); $parser->parsefile('tmp.xml'); sub sh { print $_[0]->recognized_string(); } sub eh { print $_[0]->recognized_string(); } sub ch { print $_[1]; } __END__ #twig.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Twig; my $twig = XML::Twig->new (comments => 'drop', pretty_print => 'indent +ed'); $twig->parsefile("tmp.xml"); $twig->print(); __END__ #twig2.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Twig; my $twig = XML::Twig->new (comments => 'drop'); $twig->parsefile("tmp.xml"); $twig->print(); __END__ #xslt.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::LibXML; use XML::LibXSLT; my $parser = XML::LibXML->new(); my $xslt = XML::LibXSLT->new(); my $source = $parser->parse_file('tmp.xml'); my $style_doc = $parser->parse_string(<<EOT); <?xml version='1.0'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" versi +on="1.0"> <xsl:strip-space elements="*"/> <xsl:output method="xml" indent="yes"/> <xsl:template match="@* | node()"> <xsl:copy> <xsl:apply-templates select="@* | node()" /> </xsl:copy> </xsl:template> <xsl:template match="comment()" /> </xsl:stylesheet> EOT my $stylesheet = $xslt->parse_stylesheet($style_doc); my $results = $stylesheet->transform($source); print $stylesheet->output_string($results); __END__ #make_tmp.pl #make_tmp.pl my $cnt = eval(shift()); print <<'*END*'; <?xml version="1.0" encoding="UTF-8"?> <Node_A> *END* while ($cnt--) { print <<'*END*'; <!-- One Line Comment --> <Node_B>content <!-- Two Line Comment Two Line Comment--> <Node_C> </Node_C> <!-- One Line Comment --> <!-- Multi Line Comment Line 3Comment 1Line Comment 2Line Comment Line 5Comment Line Comment--> </Node_B> *END* } print <<'*END*'; </Node_A> *END* __END__
make_tmp.pl 3459 creates a 1MB file.
In reply to Re^6: Removing XML comments with regex
by Jenda
in thread Removing XML comments with regex
by gasho
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |