perlboy_emeritus has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks

Recently I upgraded my Perl from 5.18 to 5.36, and especially because of the changes at 5.20 (no longer a performance bias against general use of $&) I've been revisiting a host of my extensive experience with regex, to see whether there are better ways to do what I'd been doing. I even contacted O'Reilly to see whether Friedl was working on a 4th edition of Mastering Regular Expressions (sadly, he's not). O'Reilly's answer was a long list of books re regex none of which are more advanced than Friedl. I replied I want to move beyond Friedl, not start over from the beginning :-( So, the only solution I see is to pose questions to you Monks in the hope of stimulating rigorous discussion, to wit, my next:

Background: Friedl poses a number of interesting numeric problems that he addresses with alternation sub-expressions (detecting and validating passwords (enforcing minimums of certain char types), dates and floats come to mind), and interestingly, most only work when the candidate bits are first extracted when they are embedded in strings so that ^ and $ anchors can bound the characters to be matched. This results in a typical two-regex pattern that consistently fails without the first regex, to isolate the bits to be checked (See my included example). Instead of (...|...|...) I've been successful using positive and negative look-ahead patterns to eliminate outliers, but that first extraction pattern seems to be essential. Without it my otherwise working regex fails. My question is, can this pattern be reliably and efficiently achieved without first isolating the sub-expression to be checked? In other words, one efficient regex rather than two?

#!/usr/bin/env -S perl -w ##!/usr/bin/env -S perl -wd use v5.30.0; use strict; #use re 'debug'; #ues Regexp::Debugger; use Test::More tests => 16; my @strings = ( "today's date 10.13.2023", "The frequency is 10346.87Hz", ); # # This is such an improvement over Friedl pg 194-195... # my $lookAhead = qr/ (?! (?: .*\.){2,}) /x; my $regex = qr/ ^ $lookAhead [+-]? [\d.]+ $/x; # # See: https://stackoverflow.com/questions/22652393/regex-1-variable-r +eset # From: Brian Foy # # Never, never, never use $& without first testing for TRUE... # See PP pg 781, the use of the word 'successful' :-( # Also see my REinsights.pl for lengthy discussion from Brian Foy re # when and whether $n, $`, $& and $' will be reset after unsuccessful # matches. # for my $str (@strings) { say "\$str => $str"; if ($str =~ / [+-]?[\d.]+ /x) { # Pattern fails without this step +; why??? if ($& =~ $regex) { say "matched"; say "\$& => $&" if $&; } else { say "unmatched"; } } else { say "unmatched"; } } sub isFloat { my $case = $_[0]; # say $case->[0], " ", $case->[1]; if ($case->[0] =~ / [+-]?[\d.]+ /x) { # MUST test here to us +e $& my $try = $&; if ($case->[1] eq 'valid') { like $try, $regex, "trying ${\(sprintf(\"%-27s: \$& => %-13s\", \"\'$case +->[0]\'\", $try))} \'$case->[1]\'"; } else { unlike $try, $regex, "trying ${\(sprintf(\"%-27s: \$& => %-13s\", \"\'$case +->[0]\'\", $try))} \'$case->[1]\'"; } } else { say "$case->[0] unmatched"; } } my @floats = ( [ '0.', 'valid' ], [ '0.007', 'valid' ], [ '.757', 'valid' ], [ '125.89', 'valid' ], [ '+10789.24', 'valid' ], [ '+107894', 'valid' ], [ '-0.0008', 'valid' ], [ 'The temperature is 28.79C', 'valid' ], [ 'Frequency: 10877.45Hz', 'valid' ], [ '255.0.0.0', 'invalid' ], [ '255.aa', 'valid' ], [ "10.13.2023, today's date", 'invalid' ], [ '0.119.255.255' , 'invalid' ], [ 'Date: 10.13.2023 BC', 'invalid' ], [ '-42', 'valid' ], [ '2004.04.12 Friedl nomatch','invalid, Friedl pg 195' ], ); say "\nRunning tests..."; say " with: $regex\n"; say "Testing mixed (valid and invalid) naked and embedded float string +s..."; for my $case (@floats) { isFloat($case); }; exit(0); __END__

O U T P U T

1..16
$str => today's date 10.13.2023
unmatched
$str => The frequency is 10346.87Hz
matched
$& => 10346.87

Running tests...
  with: (?^ux: ^ (?^ux: (?! (?: .*\.){2,}) ) +-? \d.+ $)

Testing mixed (valid and invalid) naked and embedded float strings...
ok 1 - trying '0.'                       : $& => 0.            'valid'
ok 2 - trying '0.007'                    : $& => 0.007         'valid'
ok 3 - trying '.757'                     : $& => .757          'valid'
ok 4 - trying '125.89'                   : $& => 125.89        'valid'
ok 5 - trying '+10789.24'                : $& => +10789.24     'valid'
ok 6 - trying '+107894'                  : $& => +107894       'valid'
ok 7 - trying '-0.0008'                  : $& => -0.0008       'valid'
ok 8 - trying 'The temperature is 28.79C': $& => 28.79         'valid'
ok 9 - trying 'Frequency: 10877.45Hz'    : $& => 10877.45      'valid'
ok 10 - trying '255.0.0.0'                : $& => 255.0.0.0     'invalid'
ok 11 - trying '255.aa'                   : $& => 255.          'valid'
ok 12 - trying '10.13.2023, today's date' : $& => 10.13.2023    'invalid'
ok 13 - trying '0.119.255.255'            : $& => 0.119.255.255 'invalid'
ok 14 - trying 'Date: 10.13.2023 BC'      : $& => 10.13.2023    'invalid'
ok 15 - trying '-42'                      : $& => -42           'valid'
ok 16 - trying '2004.04.12 Friedl nomatch': $& => 2004.04.12    'invalid, Friedl pg 195'

Thanks in advance to all who choose to comment. ( I can pretty much predict those Monks who will respond since they always do if I ask a good question :-) ) Also, has anyone published a better book than Friedl, 3rd?

Replies are listed 'Best First'.
Re: Best practice validating numerics with regex?
by tybalt89 (Monsignor) on Oct 17, 2023 at 06:25 UTC

    "I'm looking for an interesting discussion of ways and means"

    #!/usr/bin/perl use strict; # https://www.perlmonks.org/?node_id=11154990 use warnings; use feature 'bitwise'; use List::AllUtils qw( reduce ); my @floats = ( [ '0.', 'valid' ], [ '0.007', 'valid' ], [ '.757', 'valid' ], [ '125.89', 'valid' ], [ '+10789.24', 'valid' ], [ '+107894', 'valid' ], [ '-0.0008', 'valid' ], [ 'The temperature is 28.79C', 'valid' ], [ 'Frequency: 10877.45Hz', 'valid' ], [ '255.0.0.0', 'invalid' ], [ '255.aa', 'valid' ], [ "10.13.2023, today's date", 'invalid' ], [ '0.119.255.255' , 'invalid' ], [ 'Date: 10.13.2023 BC', 'invalid' ], [ '-42', 'valid' ], [ '2004.04.12 Friedl nomatch','invalid, Friedl pg 195' ], [ 'all of 12.34 and 3.4.5.6 and 37' ], [ '300 10.16.2023 -42 255.0.0.0 lucky 7' ], ); my $leftside = length reduce {$a |. $b->[0]} @floats; # auto-adjust for my $str ( map $_->[0], @floats ) { my @numbers = grep /./, $str =~ /(?| (?:(?:\d+\.){2,}\d+)() | ([+-]?(?:\d+(?:\.\d*)?|\.\d+)) )/gx; printf "%*s %s\n", $leftside, $str, "@numbers"; }

    Outputs:

    0. 0. 0.007 0.007 .757 .757 125.89 125.89 +10789.24 +10789.24 +107894 +107894 -0.0008 -0.0008 The temperature is 28.79C 28.79 Frequency: 10877.45Hz 10877.45 255.0.0.0 255.aa 255. 10.13.2023, today's date 0.119.255.255 Date: 10.13.2023 BC -42 -42 2004.04.12 Friedl nomatch all of 12.34 and 3.4.5.6 and 37 12.34 37 300 10.16.2023 -42 255.0.0.0 lucky 7 300 -42 7

      Extraordinary! Even multiple float candidates in a string, the point made by hv. (/g). You've sent me off on a new tangent :-) However, I hate to admit this but I have never used bitwise operations on strings and so far, the net is not a good source of examples, and perldoc List::AllUtils is a source of frustration. I've been stepping through a subset of your code with debug to get a handle on ' |. ', which I think ORs two test strings together to get the longest for the print "%*s" width expression; 36 is definitely the longest in that array. If you would, please explain for someone naive with respect to bitwise string operations this expression?

      my $leftside = length reduce {$a |. $b->[0]} @floats; # auto-adjust
      

      I think, if the next string is longer than the previous, reduce pads the length for the next test, so the char values themselves, which are Unicode, aren't themselves relevant. When the iteration finishes $leftside contains the length of the longest Unicode string. I finally gave up noodling re what the significance of ORing two chars might be, other than non-ASCII Unicode characters can be multi-byte, and settled on the notion of building up the longest 'dingus' from the set of 'dinguses'. Is that the idea?

      I would not have thought to use a bitwise operation to calculate the longest length of a set of strings though I guess that is one of the reasons why List::Util exists. My first thought was to use:

      my $leftside = length reduce { length($a) >= length($b->[0]) ? $a : $b->[0] } @floats;
      

      Or, if I got the purpose of that expression wrong, please point me to a reading assignment, other than perldoc List::AllUtils?

      Thanks tybalt89 for a very interesting example.

      Will

      U P D A T E 10/18/2023

      Thank you dasgar for insights. I used the term 'Extraordinary'; tybalt89 is brilliant, and of course, Perl is the eighth wonder of the known world. "use feature 'bitwise'" introduces |. which is useful for ORing strings, and 'bitwise' assures us that strings are treated as codepoints rather than graphenes. Why is this useful? Because length, sprintf and printf determine length attributes in codepoints, so just counting graphenes (the user visible notion of a character) can yield the wrong answer. Both length and reduce |. yield the same answer but working with bits is much faster. So, tybalt89's method of determining the longest length of the test strings in the array is the most 'efficient'. I have not benchmarked his versus mine but I have no doubt his will be an order of magnitude faster.

      The really interesting example of brilliance is that regular expression. It uses a branch reset, and as you all probably know, a branch reset insures that any alternate defined within it that matches is captured to the same $n variable. There are two sets of naked parens in that regex within the alternation, and whichever matches will be saved to $1. Now here is a piece of brilliant regex coding that blows my mind. This alternation fragment:

        (?:(?:\d+\.){2,}\d+)()
      

      is looking for invalid decimal expressions, such as IPs, as in 113.35.120.255, which are not floats but look like floats. These are to be excluded if present and because of that branch reset, the empty () saves nothing to $1. That effectively is the logical equivalent of a negative look-ahead, but certainly is faster and more efficient than using (?!...

      I posted this question in hopes of generating discussion and I got more than I bargained for; an elegant lesson in Perl magic from a master. Thank you tybalt89 for a welcome dose of enlightenment.

      Will

        I probably shouldn't be responding because I don't fully understand tybalt89's code. I think I understand part of it, but not sure I understand it well enough to try to explain to anyone.

        In looking at the my $leftside line, I tried working my way from inside out.

        For the |., I found the documentation for Bitwise String Operators. It looks like the combination of use feature 'bitwise' and |. means that bitwise string OR operation was used. I don't fully understand bitwise string OR, but from that documentation it looks like the result is a string that has the same length as the longer of the two strings used in the operation. And in tybalt89's code, the resulting string is not important - only the length of it is important.

        The next level is the reduce function. I think I get the gist of what's happening, but not sure that I can explain it well. In the code, @floats is an AoA structure. I think that the reduce function here is being used with the bitwise string OR operator to find the longest length string of the first element of the second level array. (By second level array, I am referring to the level that has 'valid" and 'invalid' strings as the second element.

        After the reduce function does its work, then the length of the final resulting string is assigned to the $leftside variable. In the printf statement, the $leftside variable is used to create a right-justified 'field' where the $str variable (the first element of the second level of the @floats AoA data) is printed.

        I admit that I'm getting lost with the regex due to my low level skill/knowledge with regexes. Treating that as a black box and looking at the inputs and outputs, it seems like the regex is pulling out valid float number values from the $str variable to put into the @numbers array, which in turn is used in the printf statement.

        I probably didn't accurately describe things, but I tried to explain what I think I understand about tybalt89's code. Not sure if it helps you to gain a better understanding of the code or not.

        In the spirit of TIMTOWTDI here's a couple of ways the get the length of the longest string.

        #!/usr/bin/perl use strict; # https://www.perlmonks.org/?node_id=11155013 use warnings; use feature 'bitwise'; use List::AllUtils qw( reduce max ); $SIG{__WARN__} = sub { die @_ }; my $longest; my @strings = split ' ', <<END; one two three four five six seven eight nine ten END $longest = max map length, @strings; # maybe simplest print "longest: $longest\n"; $longest = max map y///c, @strings; # golf trick, one char shorter :) print "longest: $longest\n"; $longest = length reduce { $a |. $b } @strings; # or'ing strings print "longest: $longest\n"; $longest = reduce { max $a, length $b } 0, @strings; # internal max() print "longest: $longest\n"; # takes a lot of length()'s $longest = length reduce { length($a) >= length($b) ? $a : $b } @strin +gs; print "longest: $longest\n";
Re: Best practice validating numerics with regex?
by hv (Prior) on Oct 16, 2023 at 22:30 UTC

    I note that each of your test strings has either a single valid number, or a single invalid sequence of digits and dots. What should be the result of, eg, '12.34 and 56.78', or '12.34 on 2004.04.12' or 'on 2004.04.12 found 12.34'?

    Your question is also a little unclear in that you ask that the answer be "efficiently" achieved; but "efficient" is a relative term - what counts as "efficient enough"? (Is there a maximum length of strings that must be parsed within whatever limits you set?)

    As far as I remember the CUT operator /(?>...)/ is not implemented in a hugely efficient manner, but it may well be sufficient for your needs. You might use that something like:

    m{ ^ [^-+\d.]* (?> ( (?# cut and capture) [-+]? (?: \.\d+ | \d+ (?: \. \d* )? ) ) ) (?!\.) }x

    This (giving the result in $1) appears to pass your existing tests, and finds "12.34" for my first two additional cases and no match for the third.

    Update: on second thoughts, the cut should not be necessary, just need to expand the tailing negative lookahead:

    m{ ^ [^-+\d.]* ( [-+]? (?: \.\d+ | \d+ (?: \. \d* )? ) ) (?![.\d]) }x

    Note also that the efficiency of CUT is suboptimal mainly when it is being hit repeatedly (eg as part of an alternation in a larger pattern), so it should be fine in this case anyway.

    Hugo

      'Efficient' is whether an alternative regular expression benchmarks faster than an existing solution. I'm looking for an approach that validates floats, or any other complex 'thing' embedded in a string, with a single regex rather than the two-step approach in the example, and no, I did not (yet) attempt a solution that extracts multiple float candidates from a single string (/g is likely). An example is just that, an example, that one can build on once one understands the limitations of one approach and the additional capabilities of an alternative approach. I tried to make it clear in my write-up that I am trying to build on lots of experience and knowledge gained from studying Friedl, without access to anything later (he used 5.8.8) or more advanced than Friedl. Cookbook, 2nd takes the regex technology only up to 5.14, so it misses the mark too on illuminating the regex state-of-the-art. More to do...

        I'm looking for an approach that validates floats, or any other complex 'thing' embedded in a string, with a single regex rather than the two-step approach in the example

        Generally matching "x but not y" is much harder than matching "x" on its own. The "float but not date" example is a fairly simple case: you can express it as / (?<! [-+.\d]) $re_float (?! [.\d]) /x, but there's quite a bundle of knowledge about the logic of a float getting distilled into that preamble and postamble. Automating that distillation for a generic "this complex thing (but not this other complex thing)" is likely to be somewhere between impossible and unprofitable.

        I haven't looked at Friedl since shortly after the first edition was published; I'd certainly recommend having a look through all of perlre and having a play with any construct that is new to you.

        More generally: context is everything. What is faster in one context is often slower in a different context. So if you have a problem you're trying to solve for which your existing solution isn't as fast as you want, you should provide it (or something like it) as the benchmark. If you're looking for something that is always better regardless of context, I don't think you'll find it.

        For more complex parsing tasks I would also recommend looking at Regexp::Grammars. Making such a grammar fast can take some fiddling, but they make complexity a lot easier to deal with.

        It's fairly easy to create a single high-performance regex that will capture the first (or every) valid float in a string. I would think that the main reason to use two regexes (one to cast a broad net, and one to validate it) would be to helpfully report syntax errors instead of skipping over them and reporting a more generic error. Is that why you're trying to do this?

        I'm also not clear on your question, really. (but, I also don't have the book you are referencing)

        my $lookAhead = qr/ (?! (?: .*\.){2,}) /x; my $regex = qr/ ^ $lookAhead [+-]? [\d.]+ $/x; ... for my $str (@strings) { say "\$str => $str"; if ($str =~ / [+-]?[\d.]+ /x) { # Pattern fails without this step +; why??? if ($& =~ $regex) {

        Your $regex uses '^' and '$', so of course you would need to load the digits into an isolated string first, so I'm guessing I don't understand the question. Could you show an example of the code construct that fails that you think should succeed?

        Oops, my bad. I wrote this comment re my definition of 'efficient' without logging in, so it is cataloged under anonymous rather than me, perlboy_emeritus. Perhaps some kind soul with admin rights can attach my real ID to that post. And perhaps I'm overstepping the purpose of perlmonks.org? I'm looking for an interesting discussion of ways and means rather than a single solution to a pending problem. Perhaps that is not what perlmonks.org is for, and if I am out of line, I will stop posting these questions.

        Will

Re: Best practice validating numerics with regex?
by haukex (Archbishop) on Oct 18, 2023 at 08:27 UTC