Can I (with XS) invoke the regex engine without making copies of the buffer?

NERDVANA has asked for the wisdom of the Perl Monks concerning the following question:

This is related to my module Crypt::SecretBuffer, where the goal of the module is to prevent leaking secrets into the freed heap memory of the perl process where something could later come along and scan the address space looking for leftover secrets. The SecretBuffer tries to isolate data from being looked at by any normal Perl ops, so that it can be exclusively be viewed by XS/C functions and wiped clean when it goes out of scope.

So, in that context, is there any way to use "pregexec" to apply the perl regex engine to my buffer but prevent it from making copies into perl-owned buffers? pregexec seems a bit under-documented... It says "described in perlreguts" but while that has pages upon pages of the inner workings of the regex engine, it doesn't even tell the meaning of the return value of pregexec or explain exactly what the final "nosave" parameter means. Ideally it would be a flag that does exactly what I want and avoids copying any buffers into any global variables, but that doesn't seem to be the case from looking at the C code. (which I admit I haven't taken the time to fully understand yet)

I'd also be OK if it made copies, but someone could tell me a reliable way to go zero out the buffers of those SVs so that all the captures magically appear to be full of NUL characters afterward.

Basically I'd like it to behave like standard C library regexec that just records positions of the capture groups in an array. I'm also debating if I should just use libc's regexes and declare that limitation on the SecretBuffer API, that you have to restrict yourself to Posix extended regex notation.

Update:

So actually the "nosave" parameter does appear to do some of what I want. Setting that flag prevents any of the magic variables from getting updated.

perl -e 'use v5.40;
         use Inline C => q{
           int call_pregexec(SV *regex, SV *sv) {
             REGEXP *rx= SvRX(regex);
             STRLEN len;
             char *buf= SvPV(sv, len);
             return pregexec(rx, buf, buf+len, buf, 0, sv, 1);
           }
         };
         say "098mnb" =~ /([0-9])([a-z])/;
         say call_pregexec(qr/([a-z])([0-9])/, "abc123");
         say $&;
         say $1; say $2;
         say $+[0]; say $+[1];'
[download]

but then, the question becomes how to find out *where* the regex matched, since it didn't update any of the output variables.

Comment on Can I (with XS) invoke the regex engine without making copies of the buffer? Download Code

Replies are listed 'Best First'.
Re: Can I (with XS) invoke the regex engine without making copies of the buffer? by dave_the_m (Monsignor) on Nov 10, 2025 at 13:03 UTC
The regex engine works on the assumption that for a successful match it will save a copy of the string and record the character offsets where $1 etc start and end. If you then try to access $1 et al, it behaves a bit like a tied variable and sets its value to that substring of the saved string. If there is no saved string, then the regex engine isn't going to allow captures - because extracting a substring of the original string which the regex was run against, could return random garbage or even SEGV if the original string had been modified or freed in the meantime. The regex engine in newer perls tries to do a copy-on-write of the original string, which means that the copy and the original share the same string buffer unless/until the original string is modifed or freed. Then the copy would take full ownership of the buffer. But trying to do COW in a guaranteed secure manner would be hard to do. In short, Perl's regex engine isn't designed to handle this scenario, and it would be hard to be confident that the string is never leaked. Dave.	[reply]

Replies are listed 'Best First'.

Re: Can I (with XS) invoke the regex engine without making copies of the buffer?
by dave_the_m (Monsignor) on Nov 10, 2025 at 13:03 UTC

If there is no saved string, then the regex engine isn't going to allow captures - because extracting a substring of the original string which the regex was run against, could return random garbage or even SEGV if the original string had been modified or freed in the meantime.

The regex engine in newer perls tries to do a copy-on-write of the original string, which means that the copy and the original share the same string buffer unless/until the original string is modifed or freed. Then the copy would take full ownership of the buffer.

But trying to do COW in a guaranteed secure manner would be hard to do.

In short, Perl's regex engine isn't designed to handle this scenario, and it would be hard to be confident that the string is never leaked.

Dave.

[reply]