Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^3: Removing XML comments with regex

by eff_i_g (Curate)
on Oct 25, 2007 at 04:08 UTC ( [id://647072]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Removing XML comments with regex
in thread Removing XML comments with regex

It's powerful, very easy to customize, extremely fast, and handles huge XML documents quite well :) And besides, it isn't simply outputting "Hello World." It's recursing through various structures/nodes to duplicate an entire document, allowing fine-tuned controls with the simple touch of XPath. Hurrah!

I'd be curious to see how it benchmarks against some of Perl's XML modules. Perhaps it's overkill for something as simple as comment removal, perhaps not.

Replies are listed 'Best First'.
Re^4: Removing XML comments with regex
by runrig (Abbot) on Oct 25, 2007 at 04:52 UTC
    If you use a perl module, I'd recommend XML::LibXSLT which uses the libxslt library under the hood, so perl's "speed" or lack thereof should not be an issue. I wouldn't recommend XML::XSLT for any serious XSLT processing though.
Re^4: Removing XML comments with regex
by Jenda (Abbot) on Oct 25, 2007 at 11:15 UTC

    Extremely fast? Doesn't really look like it at least in some cases, especially as the processed XML grows. have a look eg. at the benchmark section in this document about XStream. It doesn't seem to handle huge documents so well either.

    And I would definitely not call that "simple touch of XPath".

    I'll see if I can find time during the weekend to implement the comment stripping using a few modules and benchmark it against XSLT. I do bet it's an overkill.

      As a very rough benchmark, I created a ~20MB xml file (I took the doc in the OP, and copied the middle part over and over). I filtered the doc using XSLT (using XML::LibXSLT, and the XSLT in the node above), and XML::Twig (the solution elsewhere in this thread). The XSLT took about 3 seconds, the XML::Twig took about 20. Both used vast quantities of memory.

      And though I'm not familiar with XML::Parser::Expat, I hacked together something which seems to work (though I am likely missing something for some types of XML content), and ran in about 3 seconds without using much memory at all.

      Update: repeated XSLT with 40MB file, took just a few more seconds. I wonder if Jenda is using XML::XSLT or XML::LibXSLT below. (and an 80MB file took ~25 secs w/XSLT and ~20 w/expat) (XML::LibXSLT 1.62, XML::LibXML 1.63). Here is what I used:

        Looks like you left off exactly before XSTL started to loos breath. I tried the same with a 30MB XML and it's still busy after five minutes. expat took 9s. 8.5s if I do not copy stuff from @_.

        I'm not surprised XML::Twig was doing poorly in this test. This is exactly the wrong task for the module. It does have to parse the whole XML into a maze of objects and then is forced to stringify it again immediately.

        I'll post here the results after xslt finishes. And will run the results for a few different sizes of the XML.

        Update: XSLT just finished after more than 12 minutes.

        Update 2: here are the results, looks like XSLT is doing worse than expected at least on my computer:
          1MB 2MB 3MB 4MB 5MB 10MB 15MB 20MB 30MB 30MB/30
        expat.pl 0.30489 0.613526 0.922916 1.4342 1.802648 3.046866 4.878799 6.116883 9.050448 0.3016816
        expat2.pl 0.28943 0.5687 0.857198 1.129559 1.425147 2.819807 4.237721 5.88283 8.505741 0.2835247
        twig.pl 1.213592 2.406564 3.598347 4.79183 6.051234 12.554845 19.845587 26.74665 crash 1.3373325(*)
        twig2.pl 1.10568 2.213843 3.307619 4.474567 5.619435 11.919388 18.024169 24.710547 crash 1.23552735(*)
        xslt.pl 0.255876 0.80157 1.724701 2.540583 5.730051 22.825532 184.643654 248.187591 726.690777 24.2230259

        The expat2.pl is the same as your expat.pl except that it uses @_ directly, twig2.pl is like twig.pl, but misses the pretty_print => 'indented'. Maybe you have more memory. I have XML::Parser::Expat 2.34, XML::Twig 3.19, XML::LibXML 1.56 and XML::LibXSLT 1.53. And Windows Vista Home Pro with 3GB of memory (I have to admit I did not close all apps).

        (*) as it crashed on me for 30MB XML, this is the time per 1MB for the 20MB XML.

        so I am hearing that xml;;twig is a performance pig.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://647072]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-20 00:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found