If you use a perl module, I'd recommend XML::LibXSLT which uses the libxslt library under the hood, so perl's "speed" or lack thereof should not be an issue. I wouldn't recommend XML::XSLT for any serious XSLT processing though. | [reply] |
Extremely fast? Doesn't really look like it at least in some cases, especially as the processed XML grows. have a look eg. at the benchmark section in this document about XStream. It doesn't seem to handle huge documents so well either.
And I would definitely not call that "simple touch of XPath".
I'll see if I can find time during the weekend to implement the comment stripping using a few modules and benchmark it against XSLT. I do bet it's an overkill.
| [reply] |
As a very rough benchmark, I created a ~20MB xml file (I took the doc in the OP, and copied the middle part over and over). I filtered the doc using XSLT (using XML::LibXSLT, and the XSLT in the node above), and XML::Twig (the solution elsewhere in this thread). The XSLT took about 3 seconds, the XML::Twig took about 20. Both used vast quantities of memory.
And though I'm not familiar with XML::Parser::Expat, I hacked together something which seems to work (though I am likely missing something for some types of XML content), and ran in about 3 seconds without using much memory at all.
Update: repeated XSLT with 40MB file, took just a few more seconds. I wonder if Jenda is using XML::XSLT or XML::LibXSLT below. (and an 80MB file took ~25 secs w/XSLT and ~20 w/expat) (XML::LibXSLT 1.62, XML::LibXML 1.63).
Here is what I used:
| [reply] [d/l] |
Looks like you left off exactly before XSTL started to loos breath. I tried the same with a 30MB XML and it's still busy after five minutes. expat took 9s. 8.5s if I do not copy stuff from @_.
I'm not surprised XML::Twig was doing poorly in this test. This is exactly the wrong task for the module. It does have to parse the whole XML into a maze of objects and then is forced to stringify it again immediately.
I'll post here the results after xslt finishes. And will run the results for a few different sizes of the XML.
Update: XSLT just finished after more than 12 minutes.
Update 2: here are the results, looks like XSLT is doing worse than expected at least on my computer:
|
1MB |
2MB |
3MB |
4MB |
5MB |
10MB |
15MB |
20MB |
30MB |
30MB/30 |
expat.pl |
0.30489 |
0.613526 |
0.922916 |
1.4342 |
1.802648 |
3.046866 |
4.878799 |
6.116883 |
9.050448 |
0.3016816 |
expat2.pl |
0.28943 |
0.5687 |
0.857198 |
1.129559 |
1.425147 |
2.819807 |
4.237721 |
5.88283 |
8.505741 |
0.2835247 |
twig.pl |
1.213592 |
2.406564 |
3.598347 |
4.79183 |
6.051234 |
12.554845 |
19.845587 |
26.74665 |
crash |
1.3373325(*) |
twig2.pl |
1.10568 |
2.213843 |
3.307619 |
4.474567 |
5.619435 |
11.919388 |
18.024169 |
24.710547 |
crash |
1.23552735(*) |
xslt.pl |
0.255876 |
0.80157 |
1.724701 |
2.540583 |
5.730051 |
22.825532 |
184.643654 |
248.187591 |
726.690777 |
24.2230259 |
The expat2.pl is the same as your expat.pl except that it uses @_ directly, twig2.pl is like twig.pl, but misses the pretty_print => 'indented'. Maybe you have more memory. I have XML::Parser::Expat 2.34, XML::Twig 3.19, XML::LibXML 1.56 and XML::LibXSLT 1.53. And Windows Vista Home Pro with 3GB of memory (I have to admit I did not close all apps).
(*) as it crashed on me for 30MB XML, this is the time per 1MB for the 20MB XML.
| [reply] [d/l] [select] |
so I am hearing that xml;;twig is a performance pig.
| [reply] |