leighgable has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am looking for some guidance on this project below, which will involve parsing a directory of XML files one by one, checking for and eliminating duplicate "article" results.

I've managed to get perl reading through a directory, grabbing and parsing files, and printing out data, but now I have input buffer memory problems with the function for eliminating duplicate elements. I'm guessing a variable is getting out of control, but I'm flushing out the twig, so I am not sure where I've gone wrong. The program fails with the following message: "Ran out of memory for input buffer at /usr/lib/perl5/XML/Parser/Expat.pm line 469. at xml_result.pl line 35 at xml_result.pl line 35".

I put a tar archive of a sample of the data here. And here is the code.

Regards.

Leigh

#!/usr/bin/perl # turn on perl safety features use strict; use warnings; #initialize modules use XML::Twig; use Data::Dumper; use DirHandle; use Digest::MD5; # check for working directory my $dir = $ARGV[0] or die "Must specify directory on command line"; my($filepath) = { }; # declare file list my %md5; # file to elim duplicates my @filepath_list = xmlfiles($dir); # call xmlfiles subroutine # to get list of files from da +ta dir print "Processed files: \n"; # print list of processed foreach (@filepath_list) { # files print "$_\n"; } foreach $filepath (@filepath_list) { (my $outfile = $filepath)=~ s{\.xml$}{.clean.xml}; # dest. fil +e open( OUT, ">$outfile") or die "cannot create output file!"; my $twig = new XML::Twig( twig_handlers => { article => \&eliminate_dup }); $twig->parsefile($filepath); $twig->flush(\*OUT); #save memory close OUT; #close file } exit; sub xmlfiles { $dir = shift; print $dir, "\n"; my $dh = DirHandle->new($dir) or die "can't open directory"; return sort # sort pathnames grep { -f } # choose only files map { "$dir/$_" } # create full paths grep { !/^\./ } # filter out dot files $dh->read(); # read all filenames } sub eliminate_dup { my( $t, $elt)= @_; my $elt_text= $elt->sprint; # get text and tags my $md5= md5($elt_text); if( $md5{$md5}) { $elt->delete; } # if md5 exists, remove elemen +t else { $md5{$md5}=1; # store md5 $t->flush( \*OUT); # flush memory } }

Replies are listed 'Best First'.
Re: Memory problems parsing XML
by Sewi (Friar) on Aug 29, 2009 at 19:54 UTC
    Some suggestions:
  • Split your data file into smaller ones, either by really splitting the file into smaller files (one for every <ppsarticle> - block), or by reading the file
    open infile,$filepath; my $Block; while (<infile>) { if (/\<ppsarticle\>) { &Parse_XML($Block); undef $Block; } $Block .= $_; } &Parse_XML($Block); # Don't forget last block
  • Tie your MD5-has to a file. This will save you a huge amount of memory because the hash's content lives on disk:
    use GDBM_File; # my favorite, but there are others use Fcntl; tie %md5,'GDBM_File','/tmp/md5.tmp',O_RDWR,0600; # You may need to add + |O_CREAT after O_RDWR
  • If all this doesn't help, think about parsing the file yourself and while reading it. It would take some time, but the structure is not very complex. I'ld start with...
    open infile,$filepath; open outfile,'>'.$outfilepath; my $XML; while (<infile>) { $XML .= $_; if (s/\<\ppsarticle\>(.*?)\<\/ppsarticle\>//) { my $Article = $1; my $MD5SUM = md5($Article); $md5{$MD5SUM} and next; $md5{$MD5SUM} = 1; print outfile $Article; } }
  • A final note on $md5{$md5}: Perl could manage this, but chances that you or another developer sooner or later will mix this up are nearly at 100%. Try to advoid using the same name for strings, arrays and hashs. You'll save yourself a lot of trouble and hours of searching for typos (even worse if both $X{1} and $X->{1} are valid).

    (You still need to work at bit on the code samples, closing files, for example and actually test them against your data.)