I believe you are being bitten by regex engine leaks.
Here's what I discovered.
- If I replace _iso8601rx() with the bare minimum to parse the date/time in the test, the memory leaks disappear completely.
my %cache;
sub _iso8601_rx {
my($self,$rx) = @_;
my $dmt = $$self{'tz'};
my $dmb = $$dmt{'base'};
return $cache{ $rx } if exists $cache{ $rx };
}
$cache{cdate} = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
$cache{ctime} = '(?<h>\d\d):(?<mn>\d\d):(?<s>\d\d)';
$cache{fulldate} = "$cache{cdate}\\s+$cache{ctime}";
1;
- However, if I change that to using the fully expanded regexes, it goes back to leaking like a sieve:
my %cache;
sub _iso8601_rx {
my($self,$rx) = @_;
my $dmt = $$self{'tz'};
my $dmb = $$dmt{'base'};
return $cache{ $rx } if exists $cache{ $rx };
}
$cache{cdate} = <<'ERX';
(?i-xsm:(?:(?<y>\d\d\d\d)(?<m>\d\d)(?<d>\d\d)|(?<y>\d\d\d\d)\-(?<m>\d\
+d)\-(?<d>\d\d)|\-(?<y>\d\d)(?<m>\d\d)(?<d>\d\d)|\-(?<y>\d\d)\-(?<m>\d
+\d)\-(?<d>\d\d)|\-?(?<y>\d\d)(?<m>\d\d)(?<d>\d\d)|\-?(?<y>\d\d)\-(?<m
+>\d\d)\-(?<d>\d\d)|\-\-(?<m>\d\d)\-?(?<d>\d\d)|\-\-\-(?<d>\d\d)|(?<y>
+\d\d\d\d)\-?(?<doy>\d\d\d)|\-?(?<y>\d\d)\-?(?<doy>\d\d\d)|\-(?<doy>\d
+\d\d)|(?<y>\d\d\d\d)W(?<w>\d\d)(?<dow>\d)|(?<y>\d\d\d\d)\-W(?<w>\d\d)
+\-(?<dow>\d)|\-?(?<y>\d\d)W(?<w>\d\d)(?<dow>\d)|\-?(?<y>\d\d)\-W(?<w>
+\d\d)\-(?<dow>\d)|\-?(?<yod>\d)W(?<w>\d\d)(?<dow>\d)|\-?(?<yod>\d)\-W
+(?<w>\d\d)\-(?<dow>\d)|\-W(?<w>\d\d)\-?(?<dow>\d)|\-W\-(?<dow>\d)|\-\
+-\-(?<dow>\d)))
ERX
$cache{ctime} = <<'ERX';
(?-xism:(?:(?<h>[0-1][0-9]|2[0-3])(?<mn>[0-5][0-9])(?<s>[0-5][0-9])(?:
+[\.,]\d*)?|(?<h>[0-1][0-9]|2[0-3]):(?<mn>[0-5][0-9]):(?<s>
... bulk of the regex ellided because PM won;t let me post that much!
+...
azt|ret|mot|gyt|lrt|ut|e|a|u|k|o|d|z|t|n|p|y|g|w|s|c|i|m|b|q|v|r|x|h|f
+|l)) \))? ))))?)
ERX
$cache{fulldate} = <<'ERX';
(?x-ism:^\s*(?: (?i-xsm:(?:(?<y>\d\d\d\d)(?<m>\d\d)(?<d>\d\d)|(?<y>\d\
+d\d\d)\-(?<m>\d\d)\-(?<d>\d\d)|\-(?<y>\d\d)(?<m>\d\d)(?<d>\d\d)|\-(?<
+y>\d\d)\-(?<m>\d\d)\-(?<d>\d\d)|\-?(?<y>\d\d)(?<m>\d\d)(?<d>\d\d)|\-?
+(?<y>\d\d)\-(?<m>\d\d)\-(?<d>\d\d)|\-\-(?<m>\d
... bulk of the regex ellided because PM won't let me post that much i
+n a single post! ...
nmt|lkt|gst|vet|tjt|eat|ept|cat|pht|pwt|nft|set|gft|hst|nut|qmt|mpt|tr
+t|ywt|cdt|emt|met|ast|net|kst|ect|brt|bdt|mvt|cst|cvt|fmt|azt|ret|mot
+|gyt|lrt|ut|e|a|u|k|o|d|z|t|n|p|y|g|w|s|c|i|m|b|q|v|r|x|h|f|l)) \))?
+))))?) |
(?-xism:(?:(?<h>[0-1][0-9]|2[0-3])|\-(?<mn>[
+0-5][0-9])))
)\s*$)
ERX
1;
I thought that it was maybe the use of (so many) named captures, but I tried very hard to make them leak. A single regex with 175,000 named captures; matching /g against a string that contained 10,000 matches for them; in a (v.slow) loop. It grew very arge, but once it maxed out, it didn't leak at all.
So then I remembered that I'd seen the regex trie optimisation caused problems with large alternations, but disabling it didn't change things.
Then I thought to try your monster regexes in a standalone script and run them directly on the sample date in a loop:
#! perl
use strict;
my %cache = ( ctime => <<'RXA', cdtate => <<'RXB', fulldate -> <<'RXC'
+ );
##... monster regex initialisation ellided;
my $refull = qr[$cache{ fulldate }]x;
my $rectime = qr[$cache{ ctime }]x;
my $recdate = qr[$cache{ cdate }]x;
for (1..100e6) {
"2010-02-01 01:02:03" =~ $refull;
"2010-02-01 01:02:03" =~ $rectime;
"2010-02-01 01:02:03" =~ $recdate;
}
it doesn't leak at all. Not a jot.
So, it's not just the monster regexes, but also how they're are being used, or the results are being used that triggers the leak.
I'm kinda stuck for a direction in which to go now, but I hope that this will help you zero in on the cause. I'll keep looking.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|