in reply to Memory problems parsing XML

Some suggestions:
  • Split your data file into smaller ones, either by really splitting the file into smaller files (one for every <ppsarticle> - block), or by reading the file
    open infile,$filepath; my $Block; while (<infile>) { if (/\<ppsarticle\>) { &Parse_XML($Block); undef $Block; } $Block .= $_; } &Parse_XML($Block); # Don't forget last block
  • Tie your MD5-has to a file. This will save you a huge amount of memory because the hash's content lives on disk:
    use GDBM_File; # my favorite, but there are others use Fcntl; tie %md5,'GDBM_File','/tmp/md5.tmp',O_RDWR,0600; # You may need to add + |O_CREAT after O_RDWR
  • If all this doesn't help, think about parsing the file yourself and while reading it. It would take some time, but the structure is not very complex. I'ld start with...
    open infile,$filepath; open outfile,'>'.$outfilepath; my $XML; while (<infile>) { $XML .= $_; if (s/\<\ppsarticle\>(.*?)\<\/ppsarticle\>//) { my $Article = $1; my $MD5SUM = md5($Article); $md5{$MD5SUM} and next; $md5{$MD5SUM} = 1; print outfile $Article; } }
  • A final note on $md5{$md5}: Perl could manage this, but chances that you or another developer sooner or later will mix this up are nearly at 100%. Try to advoid using the same name for strings, arrays and hashs. You'll save yourself a lot of trouble and hours of searching for typos (even worse if both $X{1} and $X->{1} are valid).

    (You still need to work at bit on the code samples, closing files, for example and actually test them against your data.)