in reply to Re: Leaking Regex Captures
in thread Leaking Regex Captures

Note that this is very closely related to the context of: Re: Regex - Matching prefixes of a word

The original goal of the regex is to match a command string similar to:
beam 15 crew 5 wounded 2 critical to S.S.Kevorkian
Where the number-type pairs are optional and may appear in any order, provided that there is at least one of the pairs present. (No point in beaming nobody over)

Thus, the (\d+)\s*literals form of each piece,
and the (?: (capture)X | (capture)Y | (capture)Z )+ overall structure.
Wrapped around that structure is a /^(?:$regexSubstringOf{beam}|$regexSubstringOf{transport}\s* )\s*(?:$structure)\s+(?:to\s+)?$regexObjectName\s*$/i

And then it all ends up in an addCommand('transport', {crew=>$1,wound=>$2,crit=>$3},$4) if $cmd =~ /regex/i; ($4 is the ship name, captured by the $regexObjectName)


What I have done to work around the problem is to capture the whole pair, and then inside the addCommand() function, I fire off some more regex to s/\D//g the hash values if they are defined.
I also have to add a negative lookahead in the captures to prevent '5 crit' from matching as a substring of 'crew': "5cr" and stomping the $1 value before backtracking kicks in.



To sum up; I want the numbers out of those pairs, with $1 = Number of healthy Crew, $2 = number of wounded, $3 = number of critically injured.
How I get them is not important, and for multiple copies of them in the command string I don't care which one gets picked, although consistency is desirable and the last one is better than the first since that means a user can just keep typing if they make a mistake, instead of backspacing up to change the number.

Replies are listed 'Best First'.
Re^3: Leaking Regex Captures
by Marshall (Canon) on Aug 05, 2009 at 16:07 UTC
    Well, how about this....?
    #!/usr/bin/perl -w use strict; while (<DATA>) { print "testing: $_"; chomp; my @pairs = m/(\d+)\s+(\w+)/g; print "@pairs\n\n"; } #Prints: #testing: beam 15 crew 5 wounded 2 critical to S.S.Kevorkian #15 crew 5 wounded 2 critical # #testing: oh, my gosh, darn 5 killed 2 want_sex_change 10 drunk #5 killed 2 want_sex_change 10 drunk # #testing: what a day:5 wounded 2 critical 20 crew #5 wounded 2 critical 20 crew # #testing: 20 crew and 6 killed and 14 MIA #20 crew 6 killed 14 MIA __DATA__ beam 15 crew 5 wounded 2 critical to S.S.Kevorkian oh, my gosh, darn 5 killed 2 want_sex_change 10 drunk what a day:5 wounded 2 critical 20 crew 20 crew and 6 killed and 14 MIA

      That would involve a lot of post-processing to match up the numbers with the categories and filter the categories to just the valid ones ('crew', 'wounded' and 'crit'). And it can't be inserted into a larger regex match.

      (A lot of work, compared to just: "passing $1, $2, ... $N and some constants into the addCommand() function if and only if the regex matches")


      At the moment I have around 20-25 lines, each with a single regex guarding one call to addCommand(). I thus have a strong aversion to postprocessing on the matches which would cause the code to balloon up.

      As noted earlier in the thread, I do have a workaround which is suboptimal but adequate. Optimal would be if no post-processing was required, due to the captures not getting stomped on.

        That would involve a lot of post-processing to match up the numbers with the categories and filter the categories to just the valid ones ('crew', 'wounded' and 'crit'). And it can't be inserted into a larger regex match.

        Of course I'm just seeing one part of the overall picture, but with just a very minor modification to the code, I generate a hash table with the noun as the key and # as the value. Aside from the print stuff, this is just a few lines of code. I would expect that this is a sub that you call and re-use many times. To check if enough stuff is there, num of keys would give that. To see if one of these nouns is invalid, is just 2 lines of code (see below).

        Basically I would advocate some kind of data table driven approach with some rules being applied by some subs to that tabular data description. I mean if you have a validate sub that uses a table of valid nouns, then you can call that sub with other tables of valid nouns as the situation requires.

        validating user input is often harder than it first appears and I wouldn't be over concerned about 25 lines versus a whole page of code IF that page is clear. Clarity should be a higher priority than number of lines because this will lead to less buggy code that is easier to maintain.

        #!/usr/bin/perl -w use strict; my %valid = qw (crew 1 critical 1 wounded 1 killed 1); while (<DATA>) { print "testing: $_"; chomp; my %hash = reverse(m/(\d+)\s+(\w+)/g); foreach my $key(keys %hash) { print "$key $hash{$key}\n"; } my @invalid = grep {!$valid{$_}}keys %hash; print "invalid nouns: @invalid\n" if @invalid; print "\n"; } # testing: beam 15 crew 5 wounded 2 critical to S.S.Kevorkian # crew 15 # critical 2 # wounded 5 # # testing: what a day:5 wounded 2 critical 20 crew # critical 2 # crew 20 # wounded 5 # testing: 20 crew and 6 killed and 14 MIA # crew 20 # killed 6 # MIA 14 # invalid nouns: MIA __DATA__ beam 15 crew 5 wounded 2 critical to S.S.Kevorkian what a day:5 wounded 2 critical 20 crew 20 crew and 6 killed and 14 MIA