Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^4: Removing XML comments with regex

by Jenda (Abbot)
on Oct 25, 2007 at 11:15 UTC ( [id://647135]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Removing XML comments with regex
in thread Removing XML comments with regex

Extremely fast? Doesn't really look like it at least in some cases, especially as the processed XML grows. have a look eg. at the benchmark section in this document about XStream. It doesn't seem to handle huge documents so well either.

And I would definitely not call that "simple touch of XPath".

I'll see if I can find time during the weekend to implement the comment stripping using a few modules and benchmark it against XSLT. I do bet it's an overkill.

Replies are listed 'Best First'.
Re^5: Removing XML comments with regex
by runrig (Abbot) on Dec 28, 2007 at 20:59 UTC

    As a very rough benchmark, I created a ~20MB xml file (I took the doc in the OP, and copied the middle part over and over). I filtered the doc using XSLT (using XML::LibXSLT, and the XSLT in the node above), and XML::Twig (the solution elsewhere in this thread). The XSLT took about 3 seconds, the XML::Twig took about 20. Both used vast quantities of memory.

    And though I'm not familiar with XML::Parser::Expat, I hacked together something which seems to work (though I am likely missing something for some types of XML content), and ran in about 3 seconds without using much memory at all.

    Update: repeated XSLT with 40MB file, took just a few more seconds. I wonder if Jenda is using XML::XSLT or XML::LibXSLT below. (and an 80MB file took ~25 secs w/XSLT and ~20 w/expat) (XML::LibXSLT 1.62, XML::LibXML 1.63). Here is what I used:

      Looks like you left off exactly before XSTL started to loos breath. I tried the same with a 30MB XML and it's still busy after five minutes. expat took 9s. 8.5s if I do not copy stuff from @_.

      I'm not surprised XML::Twig was doing poorly in this test. This is exactly the wrong task for the module. It does have to parse the whole XML into a maze of objects and then is forced to stringify it again immediately.

      I'll post here the results after xslt finishes. And will run the results for a few different sizes of the XML.

      Update: XSLT just finished after more than 12 minutes.

      Update 2: here are the results, looks like XSLT is doing worse than expected at least on my computer:
        1MB 2MB 3MB 4MB 5MB 10MB 15MB 20MB 30MB 30MB/30
      expat.pl 0.30489 0.613526 0.922916 1.4342 1.802648 3.046866 4.878799 6.116883 9.050448 0.3016816
      expat2.pl 0.28943 0.5687 0.857198 1.129559 1.425147 2.819807 4.237721 5.88283 8.505741 0.2835247
      twig.pl 1.213592 2.406564 3.598347 4.79183 6.051234 12.554845 19.845587 26.74665 crash 1.3373325(*)
      twig2.pl 1.10568 2.213843 3.307619 4.474567 5.619435 11.919388 18.024169 24.710547 crash 1.23552735(*)
      xslt.pl 0.255876 0.80157 1.724701 2.540583 5.730051 22.825532 184.643654 248.187591 726.690777 24.2230259

      The expat2.pl is the same as your expat.pl except that it uses @_ directly, twig2.pl is like twig.pl, but misses the pretty_print => 'indented'. Maybe you have more memory. I have XML::Parser::Expat 2.34, XML::Twig 3.19, XML::LibXML 1.56 and XML::LibXSLT 1.53. And Windows Vista Home Pro with 3GB of memory (I have to admit I did not close all apps).

      (*) as it crashed on me for 30MB XML, this is the time per 1MB for the 20MB XML.

      so I am hearing that xml;;twig is a performance pig.
        As Jenda mentions below, XML::Twig is the wrong solution for this problem (update: when exceptional performance is an issue -- or if you just mean memory consumption, that was my own fault, easily corrected by mirod below). I pretty much knew as much before I started, but it was an easily available solution from this thread, so I tried it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://647135]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2024-04-26 01:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found