Viki@Stag has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I am using XML::Simple for parsing the XML files, if the XML
files are of large size (>1MB) it takes a lot of time to
parse
it.
Is there a way to pass only part of XML files to parse using this module ... or any other way to speed up things


Thanks in advance
Monks

Replies are listed 'Best First'.
Re: XML::Simple help
by andreas1234567 (Vicar) on Oct 18, 2007 at 09:29 UTC
    There are several Perl modules on CPAN for parsing XML, all with different properties. The Perl-XML Frequently Asked Questions has a section called How to choose a parser module which assesses speed:
    If speed is critical, you'll find that XML::LibXML is much faster but a bit more 'bleeding edge'.
    Disclaimer: I have not used any XML module for speed critical applications.
    --
    Andreas
Re: XML::Simple help
by j1n3l0 (Friar) on Oct 18, 2007 at 08:58 UTC
    You may want to look into either XML-Records or XML-Twig for that task.

    I have not used XML-Twig myself but XML-Records can handle 4MB files.

    Not sure how fast it would be though. For me speed was not really a problem.


    Smoothie, smoothie, hundre prosent naturlig!
Re: XML::Simple help
by DrHyde (Prior) on Oct 18, 2007 at 10:25 UTC

    I'll hazard a guess that XML::Simple is reading the whole file, parsing it, creating a data structure and only then returning. That's always going to take time, and for a large document it'll produce a *huge* data structure which might even make your machine swap - hence the sloooooowness.

    You might want to look at a streaming parser instead, which reads the file a bit at a time, generating a series of events for you to handle. It'll be a bit more work but will save an awful lot of memory. Even if it's not any faster over all, it'll at least start giving you data sooner so will *appear* to be faster!

Re: XML::Simple help
by Jenda (Abbot) on Oct 18, 2007 at 12:52 UTC

    It's hard to give you an advice if we do not know what do you plan to do with the data from the XML and/or how much of the data do you even plan to use!

    Apart from the modules others already suggested you might try XML::Rules. (Yeah, it's mine, if I don't advertise it, no one will.) It'll allow you to filter the XML as it's being parsed so then you end up with only the stuff you are interested in instead of a huuuuge, deep tree containing mostly stuff you have no use for and that only occupies the memory and maybe even forces your computer to start swapping memory.

    You can think of XML::Rules as XML::Simple on steroids, in XML::Simple you can say that you want these tags to be represented as arrays even if there is just one and to use an attribute as the hash key, but that's about it. XML::Rules will allow you to specify that for this tag you want just the content, for that one just this attribute, that you only want the dat in this tag if the attribute foo's value is 'bar', etc. etc. etc.

Re: XML::Simple help
by Krambambuli (Curate) on Oct 18, 2007 at 10:10 UTC
    ...Is there a way to pass only part of XML files to parse using this module

    I think so, yes:
    ---
    SYNOPSIS use XML::Simple; my $ref = XMLin([<xml file or string>] [, <options>]);
    So if you have a way to cut out only parts of the original XML file, you can feed XML::Simple with that. Hard to say however if that would really speed up things.

    Would it be possible to make available the problematic XML or something similar for testing ? It might allow some benchmarking and/or insight.

      Sounds like a good idea, if not, XML::Twig can do just that .. see the twig_roots mode ..
Re: XML::Simple help
by toolic (Bishop) on Oct 18, 2007 at 21:01 UTC
    Yesterday's Perl Cookbook "Recipe of the Day" seems quite relevant to your problem: http://www.perl.com/cookbook/perlckbk2/solution.csp?day=2

    It comes from Perl Cookbook, section: 22.8. Processing Files Larger Than Available Memory.

    There is a small example of how to use XML::Twig to read in only portions of a large XML file.

    Note: the URL above will point to something different tomorrow.