techtruth has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks,

I have a question about Perl and the way it handles RAM with the regex binding operator (=~).

Preface: I am running Debian Squeeze with 1024mb of RAM; My perl version is 5.10.1; I have chosen not to upgrade my RAM;

Note: I think that it is important to mention that the string I wish to do a global match for is in the area of 50mb. :(

Problem: I am attempting to return an array from a global match on a very large string (my @array = $somebigstring =~ /regex/g) but it eats up (in my understanding) far more memory than it should.

My (failing) Solutions: I have used undef() on both $somebigstring and @array, which frees a an amount of memory more or less equal to their size. This still leaves nearly all of my RAM taken up by some unknown data.
Writing the data to a file, then processing it line by line

My Thoughts: I believe that the binding operator may set some small variables for each match. These small variable's collective size becomes very large if many matches are made, as in a global match.

Are my assumptions correct, and if so is there a nice way to tell perl to release or not store that "extra" data? I am aware that perl has a memory "pool" that does not release memory to the OS until the script is over.

I will provide a code snippet below, and would like any suggestions on how to handle the memory hogging.

print "\tProcessing data...\n"; foreach (@links) { my $link = $_; # Get data from the internet my $httpReply = $browser->get($link); ####### This is my problem, the regex match eats up a lot of RAM. my @data = $httpReply->content =~ /regex/g; undef($httpReply); #Frees memory round the size of the webpage. #undef(@data); #Frees memory around the size of the array. ####### ####### If the RAM is nearly full, I am unable to completely store + the values in a hash. # Add each extracted data to the hash while(scalar @data) { my $line = shift(@data); my ($field1, $field2) = split(':', lc($line)); # Sort the emails into a hash of arrays. key = domain; value = + [username, reference] push(@{$emailHash{$field2}} , [$field1, $link]); } }

Replies are listed 'Best First'.
Re: Binding operator RAM eater
by johngg (Canon) on Apr 17, 2012 at 22:29 UTC

    Not related to your problem but I'm wondering why you do

    foreach (@links) { my $link = $_; ...

    when you could simply write

    foreach my $link (@links) { ...

    The $link scalar in the second snippet still only exists within the scope of the foreach.

    Cheers,

    JohnGG

      It's not an issue with the OP's code, but in some cases you may wish to work with a copy instead of an alias (as you would get for the foreach my $link (@links) variant) so that you can edit the variable without altering the contents of the aliased array element.

      True laziness is hard work
Re: Binding operator RAM eater
by RichardK (Parson) on Apr 17, 2012 at 21:48 UTC

    it's not clear to me what you are trying to achieve. It looks like you are trying to count the number of times the string 'regex' appears in your content, but that not very useful so ...

    Why store the intermediate data at all? Just use a while loop, something like this

    $test = "test 1 2 test 15 67 test 435 578"; while ( $content =~ /test (\d+) (\d+)/g) { say "test from $1 to $2"; }

    If that's not what you want, show us what you're searching for, and a small amount of example data.

Re: Binding operator RAM eater
by Anonymous Monk on Apr 17, 2012 at 21:52 UTC
    Once perl is allocated memory, it never returns it to the OS. Why not just operate on the string piecemeal:
    while (my ($piece) = $httpReply->content =~ /regex/g) { ... }
Re: Binding operator RAM eater
by bulk88 (Priest) on Apr 17, 2012 at 23:42 UTC
    Try using index and substr. Advance the search starting pointer character counter given to index every time the index substr loop iterates. Much faster (I think 10x) than any regex for very simple matches. You could also make a copy of content and destroy the HTTP object. Remember you can open a new block with a new my scope where every you want in perl. You can also try Devel::Leak, I've used it personally but it was useless since it turned out the memory leaks I found were interpreter bugs, not perl language leaks. Also what is your regex? regexs can be optimized, to eliminate things like "backtracking"

      index only helps if the regex is a simple substring, not a regular expression.

        Hi PerlMonks,

        After some reading and thinking, I came up with this. (see below)

        Thank you RichardK for suggesting a method for piecewise matching. I had to tweak the regex a bit by prefixing it with 'm', which I thought was assumed to be there but I guess not. :)

        Also, thank you johngg for helping me keep even one more small variable in check.

        print "\tProcessing data...\n"; foreach my $link (@links) { # Get data from the internet my $httpReply = $browser->get($link); ####### This is the solution: piecewise matching, as suggested. while($html =~ m/(value)(value2)/g) { ## This is also my (a) solution, it turns out pushing two ## variables joined by a string is _way_ less expensive than ## an array, or anonymous array. ## ##The array has more things to keep track of than a simple scalar. # Correct? ## (I kinda went "doh!" when realized this...) push(@{$data{$2}}, join("::", $1, $link) ); } }
        Also, some fun things I saw on the way: http://perl.find-info.ru/perl/028/perlbp-chp-12-sect-17.html http://perldoc.perl.org/functions/pos.html