Kal87 has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, I have a shell script that in turn calls a perl script. The function of the perl script is to read a file that contains thousands of lines (let's call it XMLList.txt), each line is the absolute path of an XML file, so the perl script takes each line as its input, opens each file and parses some nodes via XML::Twig.

The script seems to be working okay when XMLList.txt contains, say 10-15 lines.

Anything more and perl script seems to abruptly stop parsing half-way and the shell script (which calls the perl script) returns a "27000838 Segmentation fault(coredump)" error.

Should I take this to be a memory handling problem with my perl script? If yes, how can i optimize my perl script to better handle memory during the parsing?

One thing to note is that XMLList.txt can contain about 5000 lines and each XML file on that list could be, say about 20 MB each.

Here's the code for the perl script:
#!/usr/bin/perl use warnings; use strict; use Text::CSV_XS; use XML::Twig; my $csv = 'Text::CSV_XS'->new({ sep_char => '|', }); sub process_EDI_DC40 { my ($twig, $thingy) = @_; my @values = map { my $ch = $thingy->first_child( $_ ); $ch ? $ch->text : "" } qw( DOCNUM MESTYP SNDPRN RCVPOR RCVPRN ); unshift @values,'XML'; $csv->say (*STDOUT, \@values); } my $listfile = shift; open my $list, '<', $listfile or die $!; my $twig = 'XML::Twig'->new( twig_handlers => { EDI_DC40 => \&process_EDI_DC40, }, ); while (my $xmlfile = <$list>) { chomp $xmlfile; $twig->parsefile($xmlfile); }

Replies are listed 'Best First'.
Re: Segmentation coredump issue
by hippo (Archbishop) on May 04, 2018 at 11:22 UTC

    Two quick things you could try: move the declaration/instanciation of $twig inside the while loop so it doesn't persist and also check that you are running the latest versions of both the modules used.

    Does the segfault happen at precisely the same point on every run?

      Indeed it is quite possible that the problem originates within Twig but you may or may not find it. If you re-create the object each time within the loop and the program now runs to completion, it would perhaps be worthwhile to at-least note it as a bug. It would also be interesting to see (if you find that re-creating the object each time does circumvent the problem) if repeatedly parsing the same file repeatedly using a single instance leads to the same core-dump behavior, because that would probably be repeatable by the package maintainer.
Re: Segmentation coredump issue
by choroba (Cardinal) on May 04, 2018 at 14:51 UTC
    Does it still crash if you call
    $twig->purge;
    after having parsed the file?
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      @choroba: yes, still crashes. I tried adding a twig-> purge after the parse as you suggested. Also incorporated the STDERR code as suggested by @bliako to check if this was a case of a bad file that was causing the core dump.

      The processing does seem to halt on the same file, but I wouldn't think this is a case of bad data, as this one file (which is about 60 MB) has mutliple XMLs within the file, and the point in the XML that the twig halts is no different is structure to the previous XML within the same file that got processed successfully. Let me know if this makes sense.

      My guess is that some form of memory limit is being reached by the time the twig reaches this xml within the file, therefore resulting in the core dump. I would love to be proven wrong.

      So @Choroba, is there a better way to write the code, like maybe a twig_roots, instead of the twig_handlers? Still getting to grips with XML::Twig, so would love to know what you think. Also I hear XML::Sax is greatly optimized for memory when it comes to XML parsing. Are there any other modules by which a better solution for this requirement could be arrived at? Here's the code so far:
      #!/usr/bin/perl use warnings; use strict; use Text::CSV_XS; use XML::Twig; my $csv = 'Text::CSV_XS'->new({ sep_char => '|', }); sub process_EDI_DC40 { my ($twig, $thingy) = @_; my @values = map { my $ch = $thingy->first_child( $_ ); $ch ? $ch->text : "" } qw( DOCNUM MESTYP SNDPRN RCVPOR RCVPRN ); unshift @values,'XML'; $csv->say (*STDOUT, \@values); } my $listfile = shift; open my $list, '<', $listfile or die $!; my $twig = 'XML::Twig'->new( twig_handlers => { EDI_DC40 => \&process_EDI_DC40, }, ); my $fcount = 1; while (my $xmlfile = <$list>) { chomp $xmlfile; print STDERR "$0 : about to process file # $fcount = '$xmlfile'\n"; $twig->parsefile($xmlfile); print STDERR "$0 : file $xmlfile' processed OK.\n"; $fcount++; $twig->purge; }
        I tried moving the $twig->purge within the subprocess (right after $csv->say (*STDOUT, \@values);), and this seems to have done the trick (although the script appears to parse slower than before).
        I will run a few other tests and post my update in a day. Thanks for all your assistance!
Re: Segmentation coredump issue
by marto (Cardinal) on May 04, 2018 at 11:17 UTC

    What does perl -v return? Which versions of the modules are you using? Could be a bug in your perl version, or an XS module.

      perl -v returns v5.10.1

      Text::CSV_XS -- v1.35

      XML::Twig -- v3.52

        That version of perl is really pretty old. Can you replicate this by running a known crashing example against a modern perl build? Failing that, can you provide your shell script, a perl script along with an xml file (replace/remove sensitive data) known to cause the problem and I'll try to replicate the issue.

Re: Segmentation coredump issue
by bliako (Abbot) on May 04, 2018 at 20:25 UTC

    As others here said (hippo) you need to clarify whether it happens with the same file all the time.

    You say:

    The script seems to be working okay when XMLList.txt contains, say 10-15 lines.
    
    Anything more and perl script seems to abruptly stop parsing half-way and the shell script (which calls the perl script) returns a "27000838 Segmentation fault(coredump)" error.
    

    But maybe there is a bad file half-way the list which is never processed when you process the first 10-15 files.

    So, modify your code to print the filename each time, and also the number of files processed to ease you like:

    my $fcount = 1; while (my $xmlfile = <$list>) { chomp $xmlfile; print STDERR "$0 : about to process file # $fcount = '$xmlfile'\n" +; $twig->parsefile($xmlfile); print STDERR "$0 : file $xmlfile' processed OK.\n"; $fcount++; }

    Note: it is not either bugs or bad file(s). It could be bugs and bad file(s)