naum has asked for the wisdom of the Perl Monks concerning the following question:

Have an application I inherited that takes a file FTPed from another internal company server (actually OS/390) that can be quite large and contain hundreds of separate sub-files that I split out and mail out as gzipped attachments.

Problem is, when the file is very large (well that is all relative, but ~200-800M is what I'm talking about) and at some point, "system" calls (used to do "mv" and "gzip"...) fail with -1 and/or memory usage is sucked out of the box (HPUX 11). Each splitout file will be compressed and emailed (using Net::SMTP::Multipart though in new reengineered process, I chose MIME::Lite for the job...), then removed. At process end, the original fileset is moved and compressed also.

Since the process did not "use strict; use warnings" and was not properly constructed for failover & recovery (critical since mail relay server is on an entirely different host now..), I've rewritten the process as 2 separate scripts and a library routine. It's handled the volume and stress like a charm until I plug back in system calls (as opposed to a ksh to call the perl script for each splitout directory entry). This modification brought my machine to its knees, though it recoverd, but for a spout of 10 minutes, memory was exhausted.

Question to those of you Perl powers of higher accordance and blessings - how can this "system" call memory link be abated or even diagnosed? I've changed calls using 'system "rm ..."' to unlink but would it also serve better performance if I used a library like Compress::Zlib instead of 'system "gzip ..."' calls?

  • Comment on How to find a memory leak - appears to be "system" calls that are responsible

Replies are listed 'Best First'.
Re: How to find a memory leak - appears to be "system" calls that are responsible
by Abigail-II (Bishop) on May 15, 2004 at 02:11 UTC
    Well, if an external program fails due to lack of memory that doesn't mean it's the fault of the external program. If there's only little memory left, the external program will fail, but someone else is to blame.

    You don't show code, you don't tell all the commands (nor their arguments) that are being called, and you don't tell us how many are called. So, it's just guessing. But when you said "files of 200-800M" and memory problems in the same sentence, I immediately started to wonder whether you suck in the entire file. That's going to take a lot of memory - which might cause other programs to fail.

    Abigail

      Basically, file is split by header records that start with '##', each file is "cut" and placed into an output directory.

      Initial testing as I built this script up has a simple ksh to read through the directory and submit another perl script to format, mail & compress. That process handled stress test very well.

      Splitter code

      sub endEmailPackage { my ($SPLITOUT, $splitoutfilename) = @_; print $SPLITOUT endLine(); close $SPLITOUT; my $subrc = system("MailPush $splitoutfilename"); if ($subrc == 0) { logit("MailPush $splitoutfilename submitted successfully"); } else { logit("Bad return code on submission of MailPush $splitoutfile +name, return code is $?"); } sleep 2; } sub endLine { return '##END' . (' ' x 75) . "\n"; } sub scrubHeaderParm { my ($href) = @_; foreach my $k (keys %{$href}) { $href->{$k} =~ s/^\s+//; $href->{$k} =~ s/\s+$//; } } } } sub splitupFile { my ($INFILE) = @_; seek $INFILE, 0, 0; my $SPLITOUT; my $splitoutfilename; while (<$INFILE>) { if (/^##A/) { my %hopt = /$headerregex/; logit($_); scrubHeaderParm(\%hopt); foreach my $k (keys %hopt) { logit("$k: $hopt{$k}"); } endEmailPackage($SPLITOUT, $splitoutfilename) if $addrecto +t > 0; $addrectot++; $splitoutfilename = "$prepdir/$hopt{ID}.$hopt{BATCHID}.$$" +; open($SPLITOUT, "> $splitoutfilename") or alert("$!"); logit("Writing splitup output to $splitoutfilename"); } print $SPLITOUT $_ unless /^##END/; } endEmailPackage($SPLITOUT, $splitoutfilename) if $addrectot > 0; }

      Mailer code loop

      open(INFILE, "< $infile") or alert("$!"); logit("Opening $infile for reading"); my $datequal = strftime('%m%d%C%y%H%M%S', localtime()); my $ofilename = "$hopt{TPID}.$hopt{BATCHID}.$datequal.txt"; my $ofilepath = "$outdir/$ofilename"; open(my $AFILE, "> $ofilepath") or alert("$!"); logit("Opening $ofilepath for writing"); while (<INFILE>) { writeAFileOut($AFILE, $_); } close $AFILE; compressAFile(); if (deliverAPackage()) { sleep 2; my $rc; $rc = system("mv $infile $arcdir"); logit("Return code of $rc after move of $infile to $arcdir"); my $bfile = basename $infile; $rc = system("/usr/contrib/bin/gzip $arcdir/$bfile"); logit("Return code of $rc after gzip of $arcdir/$bfile"); unlink($ofilepath); } sub compressAFile { logit("Compressing $ofilepath"); my $gziprc = system("/usr/contrib/bin/gzip -f -n $ofilepath"); logit("Return code $gziprc after gzip of $ofilepath"); alert("Unable to compress $ofilepath") if ($gziprc); $ofilename = $ofilename . ".gz"; $ofilepath = "$outdir/$ofilename"; } sub deliverAPackage { my $templatefile = "$templatedir/$hopt{EDITYPE}"; alert("Failed to load template $templatefile") unless (-e $templat +efile); my $body = `cat $templatefile`; $body .= "\n\n"; $body .= "Effective Date: $hopt{DATE} \n" if ($hopt{DATE} =~ /\S+/ +); $body .= "Admin: $hopt{ADMIN}\n"; $body .= "Email: $hopt{ADMEML}\n\n"; $body .= `cat $defaulttemplate`; $subject = "$subject - $hopt{TPID}"; my $mailrc = sendEmail($hopt{EMAIL}, $subject, $body, $ofilepath, +$hopt{FILENAME}, $hopt{EXT}); return $mailrc; } sub scrubHeaderOpt { my ($href) = @_; foreach (keys %{$href}) { $href->{$_} =~ s/^\s+//; $href->{$_} =~ s/\s+$//; } $href->{EDITYPE} = substr($href->{ID}, 0, 3); $href->{EDITYPE} .= $href->{TYP} if $href->{TYP}; $href->{TPID} = substr($href->{ID}, 3); } sub writeAFileOut { my ($OFILE, $data) = @_; return if ($data =~ /^##ADD/ && $removeaddsw eq 'Y'); return if ($data =~ /^##END/); $data =~ s/\n/\r\n/g; print $OFILE $data; }

      Bottom line is if the system call to MailPush is omitted in endEmailPackage() and instead a ksh simply loops through all the files in a directory, it runs like a charm. Like this, and memory is slurped so hard, the "top" command fails for "not enough memory"...

      Also, the existing process had both functions together and for input files over 10M, at some point, system("gzip...") and system("mv...") calls would fail with -1 return code, again memory leak. Problem was alleviated somewhat when I replaced system("rm...") with unlink, but still will pop up intermittently and especially on >100M input files.

        ... if the system call to MailPush is omitted ... it runs like a charm.

        Sounds like a problem with MailPush, then, not with gzip or mv. My guess is that MailPush is trying to be helpful by daemonizing itself to send the mail and returning immediately. Then it up and loads the whole file into memory before mailing it.

        Try replacing MailPush with cat $splitoutfilename > /dev/null or something similar. If that works, try replacing it with Mail::Mailer or something similar. If that works, you're done! If it doesn't, use top to investigate while parsing a smaller file, one that doesn't completely hose the system.

Re: How to find a memory leak - appears to be "system" calls that are responsible
by TilRMan (Friar) on May 15, 2004 at 02:10 UTC

    I find it unlikely that gzip and mv would be such memory hogs (even on HP-UX ;-). Perhaps the script is doing something resource unfriendly, like foreach (<FILE>), or perhaps the system()s are being kicked off with "&" to put them in the background.

    Use top(1) to see how many processes are running and how much memory they are using. If indeed the command utilities are causing problems, you might find more recent versions on the HP-UX porting center.

Re: How to find a memory leak - appears to be "system" calls that are responsible
by tachyon (Chancellor) on May 15, 2004 at 04:15 UTC

    You don't post code so in generalities. It will almost inevitably be faster and more memory efficient to use system tools like gunzip, tar and rm directly than do the same in perl. Highly optimised C should beat Perl every time both in memory and speed terms. There is simply less overhead and at the end of the day things like Compress::Zlib are just wrappers on the standard system libraries and calls. Tools like top and Devel::DProf Devel::Leak and Devel::LeakTrace are available to diagnose issues.

    If you post example code I am sure you will get lively interest.

    cheers

    tachyon

      gzip and tar, almost certainly, but I'd bet unlink() is much faster than rm. As a shell command, rm means a fork() -- or two for a shell! And of course rm must ultimately do the unlink() as well.

      On the flip side, my first impression on gzip vs Compress::Zlib was dead wrong. ++ for setting me straight.

      Ah, the trauma of learning Perl . . .

        That's a good point about rm. I tend to use rm -rf /dir/path/* for convenience. The first operation is a shell expansion on the files/dirs which are them fed to rm as as single list so the worst case is 2 operations. There are real limitations on shell expansion with long lists (yes I do know the workarounds). Anyway in this context I am sure (but untested, so I'm not THAT sure ;-) it would be way faster than unlink/rmdir with File::Find. For single files or known lists I do use unlink. Now I have a rationale! Thanks.

        cheers

        tachyon