XML::LibXML out of memory

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: XML::LibXML out of memory by choroba (Cardinal) on May 04, 2017 at 23:40 UTC
Unfortunately, you didn't say why the file is so large, i.e. what part of the structure is repeated many times. If it's the `termEntry` , turning your code to use XML::LibXML::Reader is rather easy: #! /usr/bin/perl use warnings; use strict; use XML::LibXML::Reader; print "Importing...\n"; my $file = 'my.xml'; my $reader = 'XML::LibXML::Reader'->new(location => $file) or die; my $entry_pattern = 'XML::LibXML::Pattern'->new('/martif/text/body/ter +mEntry'); while ($reader->nextPatternMatch($entry_pattern)) { my $termEntry = $reader->copyCurrentNode(1); for my $lang_set ($termEntry->findnodes('langSet')) { my $language = $lang_set->getAttribute('xml:lang'); for my $term_grp ($lang_set->findnodes('./tig')){ my $term = $term_grp->findvalue('./term'); print "$language: $term\n"; } } } print "Done!\n"; [download] Tested with the following input: Read more... (964 Bytes) Reader is a pull parser that doesn't need to load the whole file into memory, but while walking it, you can ask it to inflate the current node into the whole DOM object (which is what `copyCurrentNode(1)` does.) ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^2: XML::LibXML out of memory by Anonymous Monk on Mar 24, 2022 at 11:52 UTC
I made a slight mod to your code as I'm experiencing a problem. #! /usr/bin/perl use warnings; use strict; use XML::LibXML::Reader; print "Importing...\n"; my $file = 'my.xml'; my $reader = 'XML::LibXML::Reader'->new(location => $file) or die; my $entry_pattern = 'XML::LibXML::Pattern'->new('/martif/text/body/ter +mEntry'); while ($reader->nextPatternMatch($entry_pattern)) { my $termEntry = $reader->copyCurrentNode(1); print "$termEntry\n"; for my $lang_set ($termEntry->findnodes('langSet')) { my $language = $lang_set->getAttribute('xml:lang'); for my $term_grp ($lang_set->findnodes('./tig')){ my $term = $term_grp->findvalue('./term'); print "$language: $term\n"; } } } print "Done!\n"; [download] I get this result, but with an interesting empty (ish) node just before "Done!" `Importing... <termEntry> <langSet xml:lang="en"> <tig><term>English</term></tig> <tig><term>Saesneg</term></tig> </langSet> <langSet xml:lang="cs"> <tig><term>Czech</term></tig> <tig><term>Tsieceg</term></tig> </langSet> <langSet xml:lang="de"> <tig><term>German</term></tig> <tig><term>Almaeneg</term></tig> </langSet> </termEntry> en: English en: Saesneg cs: Czech cs: Tsieceg de: German de: Almaeneg <termEntry/> Done!` [download] Is this expected behaviour? As I can't find any direct reference as to why this should be the case I've had some help on StackEchange which suggested this was normal behaviour - But I thought I'd ask for a second opinion This link : https://metacpan.org/dist/XML-LibXML/view/lib/XML/LibXML/Reader.pod#nextPatternMatch-(compiled_pattern) Suggests that nextPatternMatch should "Skip nodes following the current one in the document order until an element matching a given compiled pattern is reached." This is ambiguous since it doesn't specify if it's "XML_READER_TYPE_ELEMENT" or "XML_READER_TYPE_END_ELEMENT" or either. I'm wondering if I should report a bug?	[reply] [d/l] [select]
Re^3: XML::LibXML out of memory by choroba (Cardinal) on Mar 24, 2022 at 12:01 UTC
You can check the nodetype in the condition: `while ($reader->nextPatternMatch($entry_pattern) && $reader->nodeType == XML_READER_TYPE_ELEMENT ) {` [download] or, if more than one termEntry is expected, `while ($reader->nextPatternMatch($entry_pattern)) { if ($reader->nodeType == XML_READER_TYPE_ELEMENT) { ...` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^4: XML::LibXML out of memory by Anonymous Monk on Mar 24, 2022 at 13:49 UTC
Re: XML::LibXML out of memory by marto (Cardinal) on May 04, 2017 at 20:27 UTC
XML::Twig is my goto module for this kind of work.	[reply]
Re: XML::LibXML out of memory by Discipulus (Canon) on May 04, 2017 at 20:49 UTC
me too i'd go with XML::Twig the modules has his own site full of examples: in your specific situation you can profit of the `flush` or `purge` XML::Twig methods to free the momeory hold by the Twig til now (they are described in section 4.3 of the XML::Twig tutorial). See Processing_an_XML_document_chunk_by_chunk(saving memory) and also the Twig quick reference as side read to official CPAN documentation. Anyway i'm sorry for your 500 Mb of XML.. ;=) L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re: XML::LibXML out of memory by Corion (Patriarch) on May 04, 2017 at 20:31 UTC
How much memory does your machine have and do you use a 64bit Perl? 32bit Perl is restricted to use at most 4GB of RAM, which doesn't leave much room for actual data with a 500MB XML file. The "huge" option only relaxes some hardcoded limits within XML::LibXML, likely to protect against attacks like the Billion laughs attack. Maybe you have better luck with XML::Twig instead of XML::LibXML, as XML::Twig will not read the complete input file at once and construct the representation in memory.	[reply]
Re: XML::LibXML out of memory by haukex (Archbishop) on May 05, 2017 at 07:21 UTC
Based on the sample input posted here, this produces the same output as the original code: `use warnings; use strict; use XML::Twig; my $file = 'input.xml'; XML::Twig->new( twig_roots => { '/martif/text/body/termEntry/*' => sub { my ($t, $elt) = @_; for my $e ($elt->get_xpath('./tig/term')) { print $elt->{att}->{'xml:lang'}, ": ", $e->text_only, +"\n"; } $t->purge; }, }, )->parsefile($file);` [download] You could add `use open qw/:std :utf8/;` at the top to get the output printed to `STDOUT` in UTF-8 as well.	[reply] [d/l] [select]
Re: XML::LibXML out of memory by Anonymous Monk on May 04, 2017 at 21:24 UTC
Can you show us the smallest possible input file that still has all the necessary elements to produce a few lines of output, and the expected output for that input file?	[reply]
Re: XML::LibXML out of memory by Anonymous Monk on May 05, 2017 at 03:21 UTC
This is a part of the XML I am working on <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE martif SYSTEM "TBXcoreStructV02.dtd"> <martif type="TBX-Default" xml:lang="en"> <text> <body> <termEntry id="113147"> <langSet xml:lang="en"> <tig> <term>Ind.</term> </tig> <tig> <term>Independent</term> </tig> </langSet> <langSet xml:lang="fr"> <tig> <term>Ind.</term> </tig> <tig> <term>Ind�pendant</term> </tig> </langSet> </termEntry> <termEntry id="118136"> <langSet xml:lang="en"> <tig> <term>Working Party on the Election Campaign</term> </tig> </langSet> <langSet xml:lang="fr"> <tig> <term>Groupe de travail "Campagne �lectorale"</term> </tig> </langSet> </termEntry> </body> </text> </martif> [download] I was trying to avoid TWIG, as suggested, in order not to have learn to use a new parsing module...	[reply] [d/l]
Re^2: XML::LibXML out of memory by kevbot (Vicar) on May 05, 2017 at 05:38 UTC
Did you try the code posted by choroba in Re: XML::LibXML out of memory? Does it work well with your large input file?	[reply]
Re^3: XML::LibXML out of memory by Anonymous Monk on May 05, 2017 at 14:45 UTC
Tried. And it works perfectly no matter how big the XML is!	[reply]


more useful options
	PerlMonks