Some suggestions:
  • Split your data file into smaller ones, either by really splitting the file into smaller files (one for every <ppsarticle> - block), or by reading the file
    open infile,$filepath; my $Block; while (<infile>) { if (/\<ppsarticle\>) { &Parse_XML($Block); undef $Block; } $Block .= $_; } &Parse_XML($Block); # Don't forget last block
  • Tie your MD5-has to a file. This will save you a huge amount of memory because the hash's content lives on disk:
    use GDBM_File; # my favorite, but there are others use Fcntl; tie %md5,'GDBM_File','/tmp/md5.tmp',O_RDWR,0600; # You may need to add + |O_CREAT after O_RDWR
  • If all this doesn't help, think about parsing the file yourself and while reading it. It would take some time, but the structure is not very complex. I'ld start with...
    open infile,$filepath; open outfile,'>'.$outfilepath; my $XML; while (<infile>) { $XML .= $_; if (s/\<\ppsarticle\>(.*?)\<\/ppsarticle\>//) { my $Article = $1; my $MD5SUM = md5($Article); $md5{$MD5SUM} and next; $md5{$MD5SUM} = 1; print outfile $Article; } }
  • A final note on $md5{$md5}: Perl could manage this, but chances that you or another developer sooner or later will mix this up are nearly at 100%. Try to advoid using the same name for strings, arrays and hashs. You'll save yourself a lot of trouble and hours of searching for typos (even worse if both $X{1} and $X->{1} are valid).

    (You still need to work at bit on the code samples, closing files, for example and actually test them against your data.)


    In reply to Re: Memory problems parsing XML by Sewi
    in thread Memory problems parsing XML by leighgable

    Title:
    Use:  <p> text here (a paragraph) </p>
    and:  <code> code here </code>
    to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.