Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

XML::LibXML out of memory

by Anonymous Monk
on May 04, 2017 at 20:14 UTC ( [id://1189523]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

I am parsing a big XML file (~500Mb) with the following script. I run out of memory even if I set the uption HUGE

#!/usr/bin/env perl use strict; use warnings; use XML::LibXML; print "Importing...\n"; my $file = 'my.xml'; my $dom = XML::LibXML->load_xml(location => $file, huge => 1,); foreach my $termEntry ($dom->findnodes('/martif/text/body/termEntry')) + { foreach my $lang_set ($termEntry->findnodes('langSet')) { my $language = $lang_set->getAttribute('xml:lang'); foreach my $term_grp ($lang_set->findnodes('./tig')){ my $term = $term_grp->findvalue('./term'); print "$language: $term\n"; } } } print "Done!\n"; exit;

The script works perfectly with smaller files. Any suggestions?

Replies are listed 'Best First'.
Re: XML::LibXML out of memory
by choroba (Cardinal) on May 04, 2017 at 23:40 UTC
    Unfortunately, you didn't say why the file is so large, i.e. what part of the structure is repeated many times. If it's the termEntry , turning your code to use XML::LibXML::Reader is rather easy:
    #! /usr/bin/perl use warnings; use strict; use XML::LibXML::Reader; print "Importing...\n"; my $file = 'my.xml'; my $reader = 'XML::LibXML::Reader'->new(location => $file) or die; my $entry_pattern = 'XML::LibXML::Pattern'->new('/martif/text/body/ter +mEntry'); while ($reader->nextPatternMatch($entry_pattern)) { my $termEntry = $reader->copyCurrentNode(1); for my $lang_set ($termEntry->findnodes('langSet')) { my $language = $lang_set->getAttribute('xml:lang'); for my $term_grp ($lang_set->findnodes('./tig')){ my $term = $term_grp->findvalue('./term'); print "$language: $term\n"; } } } print "Done!\n";

    Tested with the following input:

    Reader is a pull parser that doesn't need to load the whole file into memory, but while walking it, you can ask it to inflate the current node into the whole DOM object (which is what copyCurrentNode(1) does.)

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      I made a slight mod to your code as I'm experiencing a problem.
      #! /usr/bin/perl use warnings; use strict; use XML::LibXML::Reader; print "Importing...\n"; my $file = 'my.xml'; my $reader = 'XML::LibXML::Reader'->new(location => $file) or die; my $entry_pattern = 'XML::LibXML::Pattern'->new('/martif/text/body/ter +mEntry'); while ($reader->nextPatternMatch($entry_pattern)) { my $termEntry = $reader->copyCurrentNode(1); print "$termEntry\n"; for my $lang_set ($termEntry->findnodes('langSet')) { my $language = $lang_set->getAttribute('xml:lang'); for my $term_grp ($lang_set->findnodes('./tig')){ my $term = $term_grp->findvalue('./term'); print "$language: $term\n"; } } } print "Done!\n";
      I get this result, but with an interesting empty (ish) node just before "Done!"
      Importing... <termEntry> <langSet xml:lang="en"> <tig><term>English</term></tig> <tig><term>Saesneg</term></tig> </langSet> <langSet xml:lang="cs"> <tig><term>Czech</term></tig> <tig><term>Tsieceg</term></tig> </langSet> <langSet xml:lang="de"> <tig><term>German</term></tig> <tig><term>Almaeneg</term></tig> </langSet> </termEntry> en: English en: Saesneg cs: Czech cs: Tsieceg de: German de: Almaeneg <termEntry/> Done!

      Is this expected behaviour? As I can't find any direct reference as to why this should be the case

      I've had some help on StackEchange which suggested this was normal behaviour - But I thought I'd ask for a second opinion

      This link : https://metacpan.org/dist/XML-LibXML/view/lib/XML/LibXML/Reader.pod#nextPatternMatch-(compiled_pattern)

      Suggests that nextPatternMatch should "Skip nodes following the current one in the document order until an element matching a given compiled pattern is reached."

      This is ambiguous since it doesn't specify if it's "XML_READER_TYPE_ELEMENT" or "XML_READER_TYPE_END_ELEMENT" or either.

      I'm wondering if I should report a bug?

        You can check the nodetype in the condition:
        while ($reader->nextPatternMatch($entry_pattern) && $reader->nodeType == XML_READER_TYPE_ELEMENT ) {

        or, if more than one termEntry is expected,

        while ($reader->nextPatternMatch($entry_pattern)) { if ($reader->nodeType == XML_READER_TYPE_ELEMENT) { ...

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: XML::LibXML out of memory
by marto (Cardinal) on May 04, 2017 at 20:27 UTC

    XML::Twig is my goto module for this kind of work.

Re: XML::LibXML out of memory
by Discipulus (Canon) on May 04, 2017 at 20:49 UTC
Re: XML::LibXML out of memory
by Corion (Patriarch) on May 04, 2017 at 20:31 UTC

    How much memory does your machine have and do you use a 64bit Perl?

    32bit Perl is restricted to use at most 4GB of RAM, which doesn't leave much room for actual data with a 500MB XML file.

    The "huge" option only relaxes some hardcoded limits within XML::LibXML, likely to protect against attacks like the Billion laughs attack.

    Maybe you have better luck with XML::Twig instead of XML::LibXML, as XML::Twig will not read the complete input file at once and construct the representation in memory.

Re: XML::LibXML out of memory
by haukex (Archbishop) on May 05, 2017 at 07:21 UTC

    Based on the sample input posted here, this produces the same output as the original code:

    use warnings; use strict; use XML::Twig; my $file = 'input.xml'; XML::Twig->new( twig_roots => { '/martif/text/body/termEntry/*' => sub { my ($t, $elt) = @_; for my $e ($elt->get_xpath('./tig/term')) { print $elt->{att}->{'xml:lang'}, ": ", $e->text_only, +"\n"; } $t->purge; }, }, )->parsefile($file);

    You could add use open qw/:std :utf8/; at the top to get the output printed to STDOUT in UTF-8 as well.

Re: XML::LibXML out of memory
by Anonymous Monk on May 04, 2017 at 21:24 UTC
    Can you show us the smallest possible input file that still has all the necessary elements to produce a few lines of output, and the expected output for that input file?
Re: XML::LibXML out of memory
by Anonymous Monk on May 05, 2017 at 03:21 UTC

    This is a part of the XML I am working on

    <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE martif SYSTEM "TBXcoreStructV02.dtd"> <martif type="TBX-Default" xml:lang="en"> <text> <body> <termEntry id="113147"> <langSet xml:lang="en"> <tig> <term>Ind.</term> </tig> <tig> <term>Independent</term> </tig> </langSet> <langSet xml:lang="fr"> <tig> <term>Ind.</term> </tig> <tig> <term>Indépendant</term> </tig> </langSet> </termEntry> <termEntry id="118136"> <langSet xml:lang="en"> <tig> <term>Working Party on the Election Campaign</term> </tig> </langSet> <langSet xml:lang="fr"> <tig> <term>Groupe de travail "Campagne électorale"</term> </tig> </langSet> </termEntry> </body> </text> </martif>

    I was trying to avoid TWIG, as suggested, in order not to have learn to use a new parsing module...

        Tried. And it works perfectly no matter how big the XML is!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1189523]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-18 23:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found