comment on

Looks like you left off exactly before XSTL started to loos breath. I tried the same with a 30MB XML and it's still busy after five minutes. expat took 9s. 8.5s if I do not copy stuff from @_.

I'm not surprised XML::Twig was doing poorly in this test. This is exactly the wrong task for the module. It does have to parse the whole XML into a maze of objects and then is forced to stringify it again immediately.

I'll post here the results after xslt finishes. And will run the results for a few different sizes of the XML.

Update: XSLT just finished after more than 12 minutes.

Update 2: here are the results, looks like XSLT is doing worse than expected at least on my computer:

1MB 2MB 3MB 4MB 5MB 10MB 15MB 20MB 30MB 30MB/30

expat.pl 0.30489 0.613526 0.922916 1.4342 1.802648 3.046866 4.878799 6.116883 9.050448 0.3016816

expat2.pl 0.28943 0.5687 0.857198 1.129559 1.425147 2.819807 4.237721 5.88283 8.505741 0.2835247

twig.pl 1.213592 2.406564 3.598347 4.79183 6.051234 12.554845 19.845587 26.74665 crash 1.3373325(*)

twig2.pl 1.10568 2.213843 3.307619 4.474567 5.619435 11.919388 18.024169 24.710547 crash 1.23552735(*)

xslt.pl 0.255876 0.80157 1.724701 2.540583 5.730051 22.825532 184.643654 248.187591 726.690777 24.2230259

	1MB	2MB	3MB	4MB	5MB	10MB	15MB	20MB	30MB	30MB/30
expat.pl	0.30489	0.613526	0.922916	1.4342	1.802648	3.046866	4.878799	6.116883	9.050448	0.3016816
expat2.pl	0.28943	0.5687	0.857198	1.129559	1.425147	2.819807	4.237721	5.88283	8.505741	0.2835247
twig.pl	1.213592	2.406564	3.598347	4.79183	6.051234	12.554845	19.845587	26.74665	crash	1.3373325(*)
twig2.pl	1.10568	2.213843	3.307619	4.474567	5.619435	11.919388	18.024169	24.710547	crash	1.23552735(*)
xslt.pl	0.255876	0.80157	1.724701	2.540583	5.730051	22.825532	184.643654	248.187591	726.690777	24.2230259

The expat2.pl is the same as your expat.pl except that it uses @_ directly, twig2.pl is like twig.pl, but misses the pretty_print => 'indented'. Maybe you have more memory. I have XML::Parser::Expat 2.34, XML::Twig 3.19, XML::LibXML 1.56 and XML::LibXSLT 1.53. And Windows Vista Home Pro with 3GB of memory (I have to admit I did not close all apps).

(*) as it crashed on me for 30MB XML, this is the time per 1MB for the 20MB XML.

#expat.pl
#!/usr/bin/perl

use strict;
use warnings;

use Time::HiRes qw( gettimeofday tv_interval );
my $started = [gettimeofday];
END {print STDERR tv_interval($started),"\n";}

use XML::Parser::Expat;

my $parser = new XML::Parser::Expat;
$parser->setHandlers('Start' => \&sh,
                     'End'   => \&eh,
                     'Char'  => \&ch);
$parser->parsefile('tmp.xml');
sub sh {
  my ($p, $e) = @_;
  print $p->recognized_string();
}

sub eh {
  my ($p, $e) = @_;
  print $p->recognized_string();
}

sub ch {
  my ($p, $s) = @_;
  print $s;
}
__END__

#expat2.pl
#!/usr/bin/perl

use strict;
use warnings;

use Time::HiRes qw( gettimeofday tv_interval );
my $started = [gettimeofday];
END {print STDERR tv_interval($started),"\n";}

use XML::Parser::Expat;

my $parser = new XML::Parser::Expat;
$parser->setHandlers('Start' => \&sh,
                     'End'   => \&eh,
                     'Char'  => \&ch);
$parser->parsefile('tmp.xml');
sub sh {
  print $_[0]->recognized_string();
}

sub eh {
  print $_[0]->recognized_string();
}

sub ch {
  print $_[1];
}
__END__

#twig.pl
#!/usr/bin/perl

use strict;
use warnings;

use Time::HiRes qw( gettimeofday tv_interval );
my $started = [gettimeofday];
END {print STDERR tv_interval($started),"\n";}

use XML::Twig;

my $twig = XML::Twig->new (comments => 'drop', pretty_print => 'indent
+ed');

$twig->parsefile("tmp.xml");
$twig->print();
__END__

#twig2.pl
#!/usr/bin/perl

use strict;
use warnings;

use Time::HiRes qw( gettimeofday tv_interval );
my $started = [gettimeofday];
END {print STDERR tv_interval($started),"\n";}

use XML::Twig;

my $twig = XML::Twig->new (comments => 'drop');

$twig->parsefile("tmp.xml");
$twig->print();
__END__

#xslt.pl
#!/usr/bin/perl

use strict;
use warnings;

use Time::HiRes qw( gettimeofday tv_interval );
my $started = [gettimeofday];
END {print STDERR tv_interval($started),"\n";}


use XML::LibXML;
use XML::LibXSLT;

my $parser = XML::LibXML->new();
my $xslt = XML::LibXSLT->new();

my $source = $parser->parse_file('tmp.xml');
my $style_doc = $parser->parse_string(<<EOT);
<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" versi
+on="1.0">
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="yes"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()" />
        </xsl:copy>
    </xsl:template>

    <xsl:template match="comment()" />

</xsl:stylesheet>
EOT

my $stylesheet = $xslt->parse_stylesheet($style_doc);

my $results = $stylesheet->transform($source);

print $stylesheet->output_string($results);
__END__

#make_tmp.pl
#make_tmp.pl
my $cnt = eval(shift());

print <<'*END*';
<?xml version="1.0" encoding="UTF-8"?>
<Node_A>
*END*

while ($cnt--) {
    print <<'*END*';
    <!-- One Line Comment -->
    <Node_B>content
        <!-- Two Line Comment
Two Line Comment-->
        <Node_C>
        </Node_C>
        <!-- One Line Comment -->
        <!-- Multi  Line Comment
  Line 3Comment
  1Line Comment
  2Line Comment
  Line 5Comment
Line Comment-->
    </Node_B>
*END*
}

print <<'*END*';
</Node_A>
*END*
__END__
[download]

make_tmp.pl 3459 creates a 1MB file.

Jenda
Support Denmark!
Defend the free world!

In reply to Re^6: Removing XML comments with regex by Jenda
in thread Removing XML comments with regex by gasho

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.