Looks like you left off exactly before XSTL started to loos breath. I tried the same with a 30MB XML and it's still busy after five minutes. expat took 9s. 8.5s if I do not copy stuff from @_.

I'm not surprised XML::Twig was doing poorly in this test. This is exactly the wrong task for the module. It does have to parse the whole XML into a maze of objects and then is forced to stringify it again immediately.

I'll post here the results after xslt finishes. And will run the results for a few different sizes of the XML.

Update: XSLT just finished after more than 12 minutes.

Update 2: here are the results, looks like XSLT is doing worse than expected at least on my computer:
  1MB 2MB 3MB 4MB 5MB 10MB 15MB 20MB 30MB 30MB/30
expat.pl 0.30489 0.613526 0.922916 1.4342 1.802648 3.046866 4.878799 6.116883 9.050448 0.3016816
expat2.pl 0.28943 0.5687 0.857198 1.129559 1.425147 2.819807 4.237721 5.88283 8.505741 0.2835247
twig.pl 1.213592 2.406564 3.598347 4.79183 6.051234 12.554845 19.845587 26.74665 crash 1.3373325(*)
twig2.pl 1.10568 2.213843 3.307619 4.474567 5.619435 11.919388 18.024169 24.710547 crash 1.23552735(*)
xslt.pl 0.255876 0.80157 1.724701 2.540583 5.730051 22.825532 184.643654 248.187591 726.690777 24.2230259

The expat2.pl is the same as your expat.pl except that it uses @_ directly, twig2.pl is like twig.pl, but misses the pretty_print => 'indented'. Maybe you have more memory. I have XML::Parser::Expat 2.34, XML::Twig 3.19, XML::LibXML 1.56 and XML::LibXSLT 1.53. And Windows Vista Home Pro with 3GB of memory (I have to admit I did not close all apps).

(*) as it crashed on me for 30MB XML, this is the time per 1MB for the 20MB XML.

#expat.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Parser::Expat; my $parser = new XML::Parser::Expat; $parser->setHandlers('Start' => \&sh, 'End' => \&eh, 'Char' => \&ch); $parser->parsefile('tmp.xml'); sub sh { my ($p, $e) = @_; print $p->recognized_string(); } sub eh { my ($p, $e) = @_; print $p->recognized_string(); } sub ch { my ($p, $s) = @_; print $s; } __END__ #expat2.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Parser::Expat; my $parser = new XML::Parser::Expat; $parser->setHandlers('Start' => \&sh, 'End' => \&eh, 'Char' => \&ch); $parser->parsefile('tmp.xml'); sub sh { print $_[0]->recognized_string(); } sub eh { print $_[0]->recognized_string(); } sub ch { print $_[1]; } __END__ #twig.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Twig; my $twig = XML::Twig->new (comments => 'drop', pretty_print => 'indent +ed'); $twig->parsefile("tmp.xml"); $twig->print(); __END__ #twig2.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::Twig; my $twig = XML::Twig->new (comments => 'drop'); $twig->parsefile("tmp.xml"); $twig->print(); __END__ #xslt.pl #!/usr/bin/perl use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $started = [gettimeofday]; END {print STDERR tv_interval($started),"\n";} use XML::LibXML; use XML::LibXSLT; my $parser = XML::LibXML->new(); my $xslt = XML::LibXSLT->new(); my $source = $parser->parse_file('tmp.xml'); my $style_doc = $parser->parse_string(<<EOT); <?xml version='1.0'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" versi +on="1.0"> <xsl:strip-space elements="*"/> <xsl:output method="xml" indent="yes"/> <xsl:template match="@* | node()"> <xsl:copy> <xsl:apply-templates select="@* | node()" /> </xsl:copy> </xsl:template> <xsl:template match="comment()" /> </xsl:stylesheet> EOT my $stylesheet = $xslt->parse_stylesheet($style_doc); my $results = $stylesheet->transform($source); print $stylesheet->output_string($results); __END__ #make_tmp.pl #make_tmp.pl my $cnt = eval(shift()); print <<'*END*'; <?xml version="1.0" encoding="UTF-8"?> <Node_A> *END* while ($cnt--) { print <<'*END*'; <!-- One Line Comment --> <Node_B>content <!-- Two Line Comment Two Line Comment--> <Node_C> </Node_C> <!-- One Line Comment --> <!-- Multi Line Comment Line 3Comment 1Line Comment 2Line Comment Line 5Comment Line Comment--> </Node_B> *END* } print <<'*END*'; </Node_A> *END* __END__

make_tmp.pl 3459 creates a 1MB file.


In reply to Re^6: Removing XML comments with regex by Jenda
in thread Removing XML comments with regex by gasho

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.