IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I am parsing an XML file using and the following (stripped down) code:

use strict; use warnings; use XML::LibXML::Reader; my $XMLfile = 'myfile.xml'; print "Parsing...\n"; my $reader = 'XML::LibXML::Reader'->new(location => $XMLfile) or d +ie print "Error"; my $entry_pattern = 'XML::LibXML::Pattern'->new('/martif/text/body +/termEntry'); my $counter=5000; my $signalprint=0; while ($reader->nextPatternMatch($entry_pattern)) { my $ID; my $value=""; my $info=""; my $termEntry = $reader->copyCurrentNode(1); for my $subject ($termEntry->findnodes('descripGrp')) { $info = $subject->findvalue('./descrip'); } for my $lang_set ($termEntry->findnodes('langSet')) { my $ID = $lang_set->getAttribute('xml:lang'); for my $term_grp ($lang_set->findnodes('./tig')){ $value = $term_grp->findvalue('./term'); } } $counter--; #committing to database every $counter executions if ($counter eq 0){ $counter=5000; $signalprint = $signalprint + $counter; print "Done $signalprint\n"; } } print "Finished!\n";

When the script reaches the end, perl crashes after printing Finished! I do not get any error message. It seems data were processed correctly. The problem must reside in this line my $termEntry = $reader->copyCurrentNode(1);, as I get the same error further stripping down the code:

use strict; use warnings; use XML::LibXML::Reader; my $XMLfile = 'myfile.xml'; print "Parsing...\n"; my $reader = 'XML::LibXML::Reader'->new(location => $XMLfile) or d +ie print "Error"; my $entry_pattern = 'XML::LibXML::Pattern'->new('/martif/text/body +/termEntry'); while ($reader->nextPatternMatch($entry_pattern)) { my $termEntry = $reader->copyCurrentNode(1); } print "Finished!\n";

In this case the script goes through the whole XML till printing "Finished!" and crashes. Note that deleting the line my $termEntry = $reader->copyCurrentNode(1); perl does not crash anymore.

Any idea?

I am on Windows 10 using perl 5.16

=================

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE martif SYSTEM "TBXcoreStructV02.dtd"> <martif type="TBX-Default" xml:lang="en"> <martifHeader> <fileDesc> <sourceDesc> </sourceDesc> </fileDesc> <encodingDesc> <p type="XCSURI">TBXXCS.xcs</p> </encodingDesc> </martifHeader> <text> <body> <termEntry id="28836"> <descripGrp> <descrip type="subjectField">04</descrip> <note>economics,</note> </descripGrp> <langSet xml:lang="en"> <tig> <term>employment policy</term> <termNote type="termType">fullForm</termNote> <descrip type="reliabilityCode">3</descrip> </tig> </langSet> <langSet xml:lang="nl"> <tig> <term>werkgelegenheidsbeleid</term> <termNote type="termType">fullForm</termNote> <descrip type="reliabilityCode">3</descrip> </tig> </langSet> </termEntry> </body> </text> </martif>

Replies are listed 'Best First'.
Re: Perl panicking XML::LibXML::Reader copyCurrentNode
by choroba (Cardinal) on Apr 30, 2018 at 13:04 UTC
    It works for me. You haven't provided any input data, so I constructed my own:
    <root> <child/> </root>

    But even when trying to use an XML that matches the pattern, it still works without crashing:

    <?xml version="1.0" encoding="utf-8"?> <martif> <text> <body> <termEntry>1 2 3 4 5 6 7 8 9</termEntry> <termEntry>abcdefghijklmnopqrstuvwxyz</termEntry> <termEntry/> <termEntry/> <termEntry/> <termEntry/> <termEntry/> <termEntry/> <termEntry> <unknown1> </unknown1> <unknown1> </unknown1> </termEntry> </body> </text> </martif>

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Perl panicking XML::LibXML::Reader copyCurrentNode (updated)
by haukex (Archbishop) on Apr 30, 2018 at 13:07 UTC

    FWIW, I can't reproduce this on Linux + Perl 5.26 + XML::LibXML::Reader 2.0132, although the bug tracker shows quite a few bugs that could explain the problem. Have you tried upgrading your XML-LibXML? If that doesn't help, do you have the possibility to upgrade your entire Perl? (Strawberry Perl 5.16.3.1 comes with libxml2-2.9.0 and XML-LibXML 2.0014, while Strawberry Perl 5.26.2.1 comes with libxml2-2.9.4 and XML-LibXML 2.0132.)

    It could of course in theory also be a problem with the input data, which you haven't shown - if you could provide a minimal sample that still causes your Perl to crash, that might help. See Short, Self-Contained, Correct Example.

    Update: The sample data you provided (Update 2: I hope I found the right TBXcoreStructV02.dtd file on the net) still doesn't cause it to crash on my machine, which means it's likely a problem with your version of XML-LibXML and/or libxml2 (or, IMO less likely, your Perl 5.16).

      Thank you for your feedbacks. While I was preparing a snippet of my input data, I discovered the following: if I delete from my input file the first two lines

      <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE martif SYSTEM "TBXcoreStructV02.dtd">

      my Perl doesn't crash. (My XML-LibXML: 2.0125). Is this normal?

      ADDED: actually it is the second line which causes Perl to crash...as I do not have any TBXcoreStructV02.dtd, how can I jump this line? As the XML is huge, I do not want to manipulate it.

        Is this normal?

        Nope :-)

        As with choroba, even with the DTD, it works for me.

        as I do not have any TBXcoreStructV02.dtd, how can I jump this line? As the XML is huge, I do not want to manipulate it.

        Well, you could add load_ext_dtd=>0 to the XML::LibXML::Reader constructor, but that's really just a workaround, and I can't even say if it'll help fix your crashes. The better solutions would be to either obtain the right file(s), and fix the crashes overall, such as by trying to upgrade your libraries.

        BTW, since you just added that in a third edit to your node, please see How do I change/delete my post? to avoid confusion.

        I downloaded the DTD from ttt.org in order to run the script (with the data shown in my previous reply), but it still doesn't crash.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,