Memory problems parsing XML

leighgable has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am looking for some guidance on this project below, which will involve parsing a directory of XML files one by one, checking for and eliminating duplicate "article" results.

I've managed to get perl reading through a directory, grabbing and parsing files, and printing out data, but now I have input buffer memory problems with the function for eliminating duplicate elements. I'm guessing a variable is getting out of control, but I'm flushing out the twig, so I am not sure where I've gone wrong. The program fails with the following message: "Ran out of memory for input buffer at /usr/lib/perl5/XML/Parser/Expat.pm line 469. at xml_result.pl line 35 at xml_result.pl line 35".

I put a tar archive of a sample of the data here. And here is the code.

Regards.

Leigh

#!/usr/bin/perl

# turn on perl safety features
        use strict;
        use warnings;

#initialize modules
        use XML::Twig;
        use Data::Dumper;
        use DirHandle;
        use Digest::MD5;

# check for working directory

my $dir = $ARGV[0] or die "Must specify directory on command line";
my($filepath) = { };                    # declare file list
my %md5;                                # file to elim duplicates




my @filepath_list = xmlfiles($dir);     # call xmlfiles subroutine
                                        # to get list of files from da
+ta dir
print "Processed files: \n";            # print list of processed
foreach (@filepath_list) {              # files
  print "$_\n";
}

foreach $filepath (@filepath_list) {

        (my $outfile = $filepath)=~ s{\.xml$}{.clean.xml}; # dest. fil
+e
        open( OUT, ">$outfile") or die "cannot create output file!";
        my $twig = new XML::Twig(
                twig_handlers => { article => \&eliminate_dup });
        $twig->parsefile($filepath);
        $twig->flush(\*OUT);                            #save memory
        close OUT;                                    #close file
}

exit;

sub xmlfiles
{
        $dir = shift;
        print $dir, "\n";
        my $dh = DirHandle->new($dir) or die "can't open directory";
        return sort                      # sort pathnames
                grep {    -f     }       # choose only files
                map  { "$dir/$_" }       # create full paths
                grep {  !/^\./   }       # filter out dot files
                $dh->read();             # read all filenames
}

sub eliminate_dup
{
        my( $t, $elt)= @_;
        my $elt_text= $elt->sprint;     # get text and tags
        my $md5= md5($elt_text);
        if( $md5{$md5}) {
                $elt->delete; }         # if md5 exists, remove elemen
+t
        else {
                $md5{$md5}=1; # store md5
                $t->flush( \*OUT);      # flush memory
        }
}
[download]

Comment on Memory problems parsing XML Download Code

Replies are listed 'Best First'.
Re: Memory problems parsing XML by Sewi (Friar) on Aug 29, 2009 at 19:54 UTC
Some suggestions: Split your data file into smaller ones, either by really splitting the file into smaller files (one for every <ppsarticle> - block), or by reading the file `open infile,$filepath; my $Block; while (<infile>) { if (/\<ppsarticle\>) { &Parse_XML($Block); undef $Block; } $Block .= $_; } &Parse_XML($Block); # Don't forget last block` [download] Tie your MD5-has to a file. This will save you a huge amount of memory because the hash's content lives on disk: `use GDBM_File; # my favorite, but there are others use Fcntl; tie %md5,'GDBM_File','/tmp/md5.tmp',O_RDWR,0600; # You may need to add + \|O_CREAT after O_RDWR` [download] If all this doesn't help, think about parsing the file yourself and while reading it. It would take some time, but the structure is not very complex. I'ld start with... `open infile,$filepath; open outfile,'>'.$outfilepath; my $XML; while (<infile>) { $XML .= $_; if (s/\<\ppsarticle\>(.*?)\<\/ppsarticle\>//) { my $Article = $1; my $MD5SUM = md5($Article); $md5{$MD5SUM} and next; $md5{$MD5SUM} = 1; print outfile $Article; } }` [download] A final note on $md5{$md5}: Perl could manage this, but chances that you or another developer sooner or later will mix this up are nearly at 100%. Try to advoid using the same name for strings, arrays and hashs. You'll save yourself a lot of trouble and hours of searching for typos (even worse if both $X{1} and $X->{1} are valid). (You still need to work at bit on the code samples, closing files, for example and actually test them against your data.)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Memory problems parsing XML
by Sewi (Friar) on Aug 29, 2009 at 19:54 UTC

Split your data file into smaller ones, either by really splitting the file into smaller files (one for every <ppsarticle> - block), or by reading the file

open infile,$filepath;
my $Block;
while (<infile>) {
 if (/\<ppsarticle\>) {
  &Parse_XML($Block);
  undef $Block;
 }
 $Block .= $_;
}
&Parse_XML($Block); # Don't forget last block
[download]

Tie your MD5-has to a file. This will save you a huge amount of memory because the hash's content lives on disk:

use GDBM_File; # my favorite, but there are others
use Fcntl;
tie %md5,'GDBM_File','/tmp/md5.tmp',O_RDWR,0600; # You may need to add
+ |O_CREAT after O_RDWR
[download]

If all this doesn't help, think about parsing the file yourself and while reading it. It would take some time, but the structure is not very complex. I'ld start with...

open infile,$filepath;
open outfile,'>'.$outfilepath;
my $XML;
while (<infile>) {
 $XML .= $_;
 if (s/\<\ppsarticle\>(.*?)\<\/ppsarticle\>//) {
  my $Article = $1;
  my $MD5SUM = md5($Article);
  $md5{$MD5SUM} and next;
  $md5{$MD5SUM} = 1;
  print outfile $Article;
 }
}
[download]

A final note on $md5{$md5}: Perl could manage this, but chances that you or another developer sooner or later will mix this up are nearly at 100%. Try to advoid using the same name for strings, arrays and hashs. You'll save yourself a lot of trouble and hours of searching for typos (even worse if both $X{1} and $X->{1} are valid).

(You still need to work at bit on the code samples, closing files, for example and actually test them against your data.)

[reply]
[d/l]
[select]