Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Memory use/leak with large number of (?{}) patterns in regex

by vr (Curate)
on Nov 24, 2019 at 13:09 UTC ( [id://11109143]=perlquestion: print w/replies, xml ) Need Help??

vr has asked for the wisdom of the Perl Monks concerning the following question:

I wanted to match strings containing many fixed-point low-precision numerals, with "text"/whitespace in between. Because of nature of input source, their comparison should be numerically tolerant/fuzzy. I ended with solution which involves programmatically generated regular expressions with (?(?{...})(*F)) per number, i.e. there are many of these in a regex.

With some input, Perl started to segfault and "Out of memory!" on me. Upon investigation, this input happens to have a degenerate case with many thousands of numerals for a single sub-string, for which a regex was created. This sub-string should have been probably excluded from processing in the first place, but I was curios what's going on. Here is SSCE redacted to quite useless no-op:

use strict; use warnings; use re 'eval'; STDOUT-> autoflush( 1 ); use constant LEN => 10_000; my $n = 0; my $s = '1' x LEN;; my $r = '(1)(?{ # $^N ? 1 : 1; # (*) does use of $^N exacerbate? print "*" and select # to better watch undef, undef, undef, 0.25 # with htop unless ++ $n % 1000 # })' x LEN; print "\nMatch\n" if $s =~ /^$r$/; print "1) Hit Enter"; <>; ( $s = '' ) =~ //; # reset everything # about $s and re-engine (?) $s = ( int rand 10 ) x 1e9; # allocate another Gb print "2) Hit Enter"; <>;

I'm testing with 64-bit Perl and Linux and 8 Gb RAM. With LEN => 10_000, Perl eats ~1 Gb of memory, and apparently sits on it/doesn't free it when it needs more. With 20_000, it's already ~4.5 Gb, and + 1 Gb upon scalar creation (memory is not freed even after re-engine was reset?). With 20_000 and (*) line un-commented, Perl segfaults after 13 stars; it doesn't appear to have consumed all available RAM. With 30_000 and (*) line commented back, it's "panic: memory wrap at (eval 6) line 155489. Attempt to free unreferenced scalar: SV 0x56258a3af7b0, Perl interpreter: 0x562584d66260 at (eval 6) line 155489." after 22 stars.

Arguably, regex with 10_000 of (?{}) is stupid, but I wonder if it indicates slow leak in case of "normal" number of this pattern and long-running process.

Replies are listed 'Best First'.
Re: Memory use/leak with large number of (?{}) patterns in regex
by dave_the_m (Monsignor) on Nov 24, 2019 at 15:10 UTC
    Well, the lack of freeing is unremarkable. Running the regex is likely to malloc() and and finally free() lots of small chunks of memory. These will be reused if you run a similar regex again, but trying to then malloc() a single 1Gb string is unlikely to be able to make use of all those little blocks recently freed.

    However, what *is* worrying is that memory usage goes quadratic on the number of code blocks in the pattern. I'll try to have a look at it sometime when I have the time.

    Dave.

      It's the combination of captures and code blocks. Each time the regex engine is about to execute a code block, it saves the indices of all the captures done so far, so they can be restored at the end. It does this on the pessimistic assumption that code within the block can do anything, including recursively executing the same regex again, overwriting the existing capture indices.

      This is why quadratic memory behaviour is being seen.

      Not ideal, but can avoided if you use non-capturing braces.

      Dave.

        Could saving the capture indices be lazily done, with some kind of "regex in use" flag set on the regex, such that recursively executing the same regex causes the capture indices to be preserved, but only if really needed?

        This would slightly add to the general regex overhead, from needing to check the "regex in use" flag on every pattern match, but perhaps that could be folded into the existing logic that handles compiling patterns when needed?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11109143]
Approved by johngg
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (7)
As of 2024-03-29 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found