Ardemus has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE: I haven't confirmed this yet, but xdg mentioned that versions 5.8.1 up to (but not including) 5.8.3 have a memory leak in RegEx. It seems likely that this is my problem.

I'll update after I've upgraded and tested my code.

--------------------------------------------------------

Hello folks. I'm exhausted, so please forgive me if this isn't well presented:

I have two test files. A data file of about 150,000 lines and a temp file with about 800 lines, each containing 3 values to use in my search expression (I only use two of them).

This code is intended to perform the substitutions and insertions, then save out the file to a new location.

Unfortunately, it consumes all of my physical memory in a few seconds, then chugs along until windows cuts it off at about 2GB and then kills it.

Although the code isn't as pretty or concise as it could be, is there anything here that can be changed to prevent this gross overuse of memory? It's as if the program is making a complete copy of the data for every iteration.

#! perl open (XAPFILE, 'C:\Documents and Settings\Nick\Desktop\040408 Work Fil +es\stranger.xap') or die; my $file = join ("",<XAPFILE>); my ($Cue, $Sound, $Pri); my $r_file = \$file; open (PRIFILE, '<C:\temp.txt') or die; open (OUTFILE, '>C:\stranger.xap') or die; while (<PRIFILE>){ { ($Cue, $Sound, $Pri) = split ("\t"); chomp ($Pri); if ($$r_file =~ s/(Sound\s*\{\s* Name\ =\ $Sound;\s* Priority\ =\ )([\d\D]*?);/$1$Pri;/xm){ print "Update $Sound > $Pri \n"; #print "'$1$Pri;'\n"; }elsif ($$r_file =~ s/(Sound\s*\{(\s*) Name\ =\ $Sound;\s*$)/$1$2Priority\ =\ $Pri;\n/xm){ + #print "$1$2Priority = $Pri;\n"; #print "1'$1'\n2'$2'\n3'$3'\n4'$4'\n"; print "Add $Sound > $Pri \n"; }else{ print "ERROR $Cue > $Sound > $Pri \n"; } } } print OUTFILE $file;
Any thoughts would be greatly appreciated. I've been trying to teach myself references, OO, modules, etc. to make a tool. That's been my life for the past 4 days. This is just a hack to get the task done on my deadline, and I'm at my wits end trying to make it finish.

FYI - The data that is found is never more than four lines long, and each line is only a few dozen characters at most. I don't care if it takes 10 minutes to process, I just want it to finish.

Thanks

Nick

  • Comment on RegEx on 4MB file consumes of 2GB of ram before windows shuts it down (Memory Leak in 5.8.2)
  • Download Code

Replies are listed 'Best First'.
Re: RegEx on 4MB file consumes of 2GB of ram before windows shuts it down.
by tachyon (Chancellor) on Apr 12, 2004 at 07:25 UTC

    The major issue is probably [\d\D]*? which matches everything (ie digits and non digits == everything) and then backtracks. Anyway your REs are a bit ugly. OK really ugly. There is also absolutely no reason to use $r_file as you never pass $file outside of 'main'. Also you have a useless set braces.

    If you post what you want to do ie input/change file/output then I am sure you will get plenty of help. We need to see the exact input data, exact change file data, and desired output. </code>

    cheers

    tachyon

      I agree that the RE is ugly -- for example [\d\D] could just be expressed as . but the *? is a lazy quantifier, so won't match everything and then backtrack.

      (To the poster -- I highly recommend picking up the book "Mastering Regular Expressions" if you're going to use regular expressions on a regular basis.)

      update: To the poster, a couple questions. Do you get any output from this or does it just die? (You're printing to STDOUT, not your OUTFILE anyway, so if you get no output at all, then this is dying before it ever hits a print statement.) How big is stranger.xap in bytes, not lines? Have you run this code using the perl debugger and seen where/how it blows up?

      Looking at your code, it looks like you're pulling all of your XAPFILE into $file. Is there a reason you need to do that versus going line by line and saving into a new output file rather than overwriting your old one? Your regular expressions are ending with newlines anyway. (Data spans multiple lines and is semicolon terminated? Is that why you're using [\d\D]*?; in your RE or is your priority just a short number?)

      It would definitely help to post a few examples of data from stranger.xap and temp.txt and explain what you're trying to do to make sense of your code and help you out.

      update 2:Looking closer at your RE's, I think they aren't doing what you think. Look at the two RE's -- just the matching part -- next to each other. (I've replaced the "\ " with "\s" for clarity here:)

      (Sound\s*\{\s*Name\s=\s$Sound;\s*Priority\s=\s)([\d\D]*?); (Sound\s*\{(\s*)Name\s=\s$Sound;\s*$)

      These differ in a couple ways. Obviously, the first is looking for the case where a "Priority" exists, but the second also is capturing any potential spaces at a certain point with (\s*). Why? Also, the second one terminates at an end of line (since you use the "/m" modifier and have a "$"), but the first terminates just at the semicolon. Again, why? Do these lines wind up differing by more than just whether or not a Priority exists?

      Then look at your replacement parts, again next to each other.

      $1$Pri; $1$2Priority\ =\ $Pri;\n

      First of all, those "\ " in the second line will show up literally in the replacement -- not what you want. You're also only terminating with a "\n" on the second one -- that doesn't seem right, either. And the "$1$2" on the second line is only necessary because you're capturing those spaces. Overall, making some assumptions about what you're trying to do, it seems like these two could be combined pretty easily as:

      s/(Sound\s*\{\s*Name\s=\s$Sound).*?;/$1; Priority = $Pri;/m

      I still think that probably won't do what you want, as I don't understand how your data spans lines (or not) or really what your data format is, but I think the RE above is a cleaner version of what your existing RE's are trying to do.

      -xdg

      Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

        Thanks for looking into it.

        1) The code does not die. It consumes memory until windows decides not to allow any more. Then I get a windows XP dialog box that says Perl has been naughty an will be closed.

        2) Data spans multiple lines, but the priority is just a single 1 byte integer. [\d\D]*? could be replaced with \d*? in this case. However the RegEx was made to be generic, and I just plugged some values into it (some properties contain any character and can span multiple lines).

        3) You're right, the two RE's differ, and they're meant to handle the two cases where a priority line does and does not exist. First, this is the data format:

        Sound { Name = example; Priority = 99; etc... { (possible sub sections) }etc...
        The whitespace may varry.

        That may make things more clear. In the first RE I can simply replace the number, so I have "$1$pri;". However, in the second, I need to add the a new line, including the correct white space. $1 contains the data up to the point where I want to add the line (that's why I force it to end at the line). $2 contains a carriage return and the leading white space before "Name". So adding it after $1 duplicates the whitespace for my new line. Then I add the new priority line and a carriage return.

        It may be true that the whitespace is off (and the backslashes show up in my resulting data). However, that is something that I can easily debug once I start getting output results from the program.

        My primary concern is the massive memory consumption, which prevents the program from finishing. It seems as if there is some sort of memory leak where, for example, Perl is creating the $`$&$' variables and then not destroying them for the next itteration.

        I could be wrong, and it could be by design, but what I'm hoping for is a way to get perl to do this tasks without using more than 2 gigabytes of ram.

        Thanks for the feedback, do you have any idea why this script is using so much memory (and how to avoid it)?

        Thanks,

        Nick

Re: RegEx on 4MB file consumes of 2GB of ram before windows shuts it down.
by Art_XIV (Hermit) on Apr 12, 2004 at 13:43 UTC

    The fact that you're slurping 'stranger.xap' into memory isn't helping your memory consumption.

    It may be much more efficient (from a memory standpoint) to extract the values that you need from 'temp.txt' and then scan 'stranger.xap' by lines or paragraphs (faq) to find matches.

    Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
Re: RegEx on 4MB file consumes of 2GB of ram before windows shuts it down (Memory Leak in 5.8.2)
by tachyon (Chancellor) on Apr 12, 2004 at 22:59 UTC

    Here is how you should probably do it by using a lookup hash for the substitutions and the input record separator.

    #! perl # generate a single RE/hash for replacement open (PRIFILE, '<C:\temp.txt') or die; my %lookup; while (<PRIFILE>){ chomp; ($Cue, $Sound, $Pri) = split ("\t"); $lookup{$Sound} = $Pri; } close PRIFILE; my $re = join '|', keys %lookup; $re = qr/Name\s*=\s*($re)\s*;/; open (XAPFILE, 'C:\Documents and Settings\Nick\Desktop\040408 Work Fil +es\stranger.xap') or die; open (OUTFILE, '>C:\stranger.xap') or die; # set input record separator so we read a record at a time local $/ = "Sound\n{"; while (my $record = <XAPFILE>){ if ( $record =~ m/$re/ ) { my $sound = $1; my $delta_pri = $lookup{$sound}; # change existing pri unless ( $record =~ s/Priority\s*=\s*\d+/Priority=$delta_pri/ +) { # could not change so need to add $record =~ s/$sound/$sound\nPriority=$delta_pri;\n/; } } print OUTFILE $record; } close XAPFILE; close OUTFILE;

    cheers

    tachyon

      I'm actually working on a rather complex module to handle this type of thing. I guess I gave the impression that a sound entry was the only type of data object in the file. In fact it is a self referential nested tree with many types of objects...

      :)

      I could, however, slurp a line at a time until the line was ^\s*Sound$ then make sure the next line was correct, check if the name matched (and steal the white space). Finally I'd check the next line and either add a priority line or update it. I could write each line right back out to the output file as I go.

      That's a much better approach and much less prone to bugs (and it would work around the reg-ex memory leak in 5.8.2).

      Thanks

        You don't parse self referential nested trees with simple REs. You parse them with a parser (typically recursive descent) and then work over the nodes.

        Regardless of that the code I supplied does the same as the regexes you were trying to use in your original post except a lot more efficiently. You seem to have missed what it does. Also you seem happy to blame a memory leak in the RE engine. Besides the fact that your REs are really rather badly written I pointed out you have an extra set of braces:

        while(<>) { { # surplus to requirements

        this may well be creating a closure.

        cheers

        tachyon