in reply to Re: RegEx on 4MB file consumes of 2GB of ram before windows shuts it down.
in thread RegEx on 4MB file consumes of 2GB of ram before windows shuts it down (Memory Leak in 5.8.2)

I agree that the RE is ugly -- for example [\d\D] could just be expressed as . but the *? is a lazy quantifier, so won't match everything and then backtrack.

(To the poster -- I highly recommend picking up the book "Mastering Regular Expressions" if you're going to use regular expressions on a regular basis.)

update: To the poster, a couple questions. Do you get any output from this or does it just die? (You're printing to STDOUT, not your OUTFILE anyway, so if you get no output at all, then this is dying before it ever hits a print statement.) How big is stranger.xap in bytes, not lines? Have you run this code using the perl debugger and seen where/how it blows up?

Looking at your code, it looks like you're pulling all of your XAPFILE into $file. Is there a reason you need to do that versus going line by line and saving into a new output file rather than overwriting your old one? Your regular expressions are ending with newlines anyway. (Data spans multiple lines and is semicolon terminated? Is that why you're using [\d\D]*?; in your RE or is your priority just a short number?)

It would definitely help to post a few examples of data from stranger.xap and temp.txt and explain what you're trying to do to make sense of your code and help you out.

update 2:Looking closer at your RE's, I think they aren't doing what you think. Look at the two RE's -- just the matching part -- next to each other. (I've replaced the "\ " with "\s" for clarity here:)

(Sound\s*\{\s*Name\s=\s$Sound;\s*Priority\s=\s)([\d\D]*?); (Sound\s*\{(\s*)Name\s=\s$Sound;\s*$)

These differ in a couple ways. Obviously, the first is looking for the case where a "Priority" exists, but the second also is capturing any potential spaces at a certain point with (\s*). Why? Also, the second one terminates at an end of line (since you use the "/m" modifier and have a "$"), but the first terminates just at the semicolon. Again, why? Do these lines wind up differing by more than just whether or not a Priority exists?

Then look at your replacement parts, again next to each other.

$1$Pri; $1$2Priority\ =\ $Pri;\n

First of all, those "\ " in the second line will show up literally in the replacement -- not what you want. You're also only terminating with a "\n" on the second one -- that doesn't seem right, either. And the "$1$2" on the second line is only necessary because you're capturing those spaces. Overall, making some assumptions about what you're trying to do, it seems like these two could be combined pretty easily as:

s/(Sound\s*\{\s*Name\s=\s$Sound).*?;/$1; Priority = $Pri;/m

I still think that probably won't do what you want, as I don't understand how your data spans lines (or not) or really what your data format is, but I think the RE above is a cleaner version of what your existing RE's are trying to do.

-xdg

Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

Replies are listed 'Best First'.
Re: Re: Re: RegEx on 4MB file consumes of 2GB of ram before windows shuts it down.
by Ardemus (Beadle) on Apr 12, 2004 at 13:51 UTC
    Thanks for looking into it.

    1) The code does not die. It consumes memory until windows decides not to allow any more. Then I get a windows XP dialog box that says Perl has been naughty an will be closed.

    2) Data spans multiple lines, but the priority is just a single 1 byte integer. [\d\D]*? could be replaced with \d*? in this case. However the RegEx was made to be generic, and I just plugged some values into it (some properties contain any character and can span multiple lines).

    3) You're right, the two RE's differ, and they're meant to handle the two cases where a priority line does and does not exist. First, this is the data format:

    Sound { Name = example; Priority = 99; etc... { (possible sub sections) }etc...
    The whitespace may varry.

    That may make things more clear. In the first RE I can simply replace the number, so I have "$1$pri;". However, in the second, I need to add the a new line, including the correct white space. $1 contains the data up to the point where I want to add the line (that's why I force it to end at the line). $2 contains a carriage return and the leading white space before "Name". So adding it after $1 duplicates the whitespace for my new line. Then I add the new priority line and a carriage return.

    It may be true that the whitespace is off (and the backslashes show up in my resulting data). However, that is something that I can easily debug once I start getting output results from the program.

    My primary concern is the massive memory consumption, which prevents the program from finishing. It seems as if there is some sort of memory leak where, for example, Perl is creating the $`$&$' variables and then not destroying them for the next itteration.

    I could be wrong, and it could be by design, but what I'm hoping for is a way to get perl to do this tasks without using more than 2 gigabytes of ram.

    Thanks for the feedback, do you have any idea why this script is using so much memory (and how to avoid it)?

    Thanks,

    Nick

      Ick. What version of perl are you using? On a hunch, some quick googling revealed that some regex memory leaks with the s/// operator were introduced in 5.8.1. Seems like they should be fixed in 5.8.3.

      If you can't upgrade, then your best bet may be to try to read the file section by section rather than having the whole file in memory each time through. Best of luck!

      -xdg

      Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

        perl -v shows version 5.8.2 - Thanks, that's probably it. Nick