in reply to Best practice validating numerics with regex?

I note that each of your test strings has either a single valid number, or a single invalid sequence of digits and dots. What should be the result of, eg, '12.34 and 56.78', or '12.34 on 2004.04.12' or 'on 2004.04.12 found 12.34'?

Your question is also a little unclear in that you ask that the answer be "efficiently" achieved; but "efficient" is a relative term - what counts as "efficient enough"? (Is there a maximum length of strings that must be parsed within whatever limits you set?)

As far as I remember the CUT operator /(?>...)/ is not implemented in a hugely efficient manner, but it may well be sufficient for your needs. You might use that something like:

m{ ^ [^-+\d.]* (?> ( (?# cut and capture) [-+]? (?: \.\d+ | \d+ (?: \. \d* )? ) ) ) (?!\.) }x

This (giving the result in $1) appears to pass your existing tests, and finds "12.34" for my first two additional cases and no match for the third.

Update: on second thoughts, the cut should not be necessary, just need to expand the tailing negative lookahead:

m{ ^ [^-+\d.]* ( [-+]? (?: \.\d+ | \d+ (?: \. \d* )? ) ) (?![.\d]) }x

Note also that the efficiency of CUT is suboptimal mainly when it is being hit repeatedly (eg as part of an alternation in a larger pattern), so it should be fine in this case anyway.

Hugo

Replies are listed 'Best First'.
Re^2: Best practice validating numerics with regex?
by Anonymous Monk on Oct 16, 2023 at 23:52 UTC
    'Efficient' is whether an alternative regular expression benchmarks faster than an existing solution. I'm looking for an approach that validates floats, or any other complex 'thing' embedded in a string, with a single regex rather than the two-step approach in the example, and no, I did not (yet) attempt a solution that extracts multiple float candidates from a single string (/g is likely). An example is just that, an example, that one can build on once one understands the limitations of one approach and the additional capabilities of an alternative approach. I tried to make it clear in my write-up that I am trying to build on lots of experience and knowledge gained from studying Friedl, without access to anything later (he used 5.8.8) or more advanced than Friedl. Cookbook, 2nd takes the regex technology only up to 5.14, so it misses the mark too on illuminating the regex state-of-the-art. More to do...

      I'm looking for an approach that validates floats, or any other complex 'thing' embedded in a string, with a single regex rather than the two-step approach in the example

      Generally matching "x but not y" is much harder than matching "x" on its own. The "float but not date" example is a fairly simple case: you can express it as / (?<! [-+.\d]) $re_float (?! [.\d]) /x, but there's quite a bundle of knowledge about the logic of a float getting distilled into that preamble and postamble. Automating that distillation for a generic "this complex thing (but not this other complex thing)" is likely to be somewhere between impossible and unprofitable.

      I haven't looked at Friedl since shortly after the first edition was published; I'd certainly recommend having a look through all of perlre and having a play with any construct that is new to you.

      More generally: context is everything. What is faster in one context is often slower in a different context. So if you have a problem you're trying to solve for which your existing solution isn't as fast as you want, you should provide it (or something like it) as the benchmark. If you're looking for something that is always better regardless of context, I don't think you'll find it.

      For more complex parsing tasks I would also recommend looking at Regexp::Grammars. Making such a grammar fast can take some fiddling, but they make complexity a lot easier to deal with.

      It's fairly easy to create a single high-performance regex that will capture the first (or every) valid float in a string. I would think that the main reason to use two regexes (one to cast a broad net, and one to validate it) would be to helpfully report syntax errors instead of skipping over them and reporting a more generic error. Is that why you're trying to do this?

      I'm also not clear on your question, really. (but, I also don't have the book you are referencing)

      my $lookAhead = qr/ (?! (?: .*\.){2,}) /x; my $regex = qr/ ^ $lookAhead [+-]? [\d.]+ $/x; ... for my $str (@strings) { say "\$str => $str"; if ($str =~ / [+-]?[\d.]+ /x) { # Pattern fails without this step +; why??? if ($& =~ $regex) {

      Your $regex uses '^' and '$', so of course you would need to load the digits into an isolated string first, so I'm guessing I don't understand the question. Could you show an example of the code construct that fails that you think should succeed?

      Oops, my bad. I wrote this comment re my definition of 'efficient' without logging in, so it is cataloged under anonymous rather than me, perlboy_emeritus. Perhaps some kind soul with admin rights can attach my real ID to that post. And perhaps I'm overstepping the purpose of perlmonks.org? I'm looking for an interesting discussion of ways and means rather than a single solution to a pending problem. Perhaps that is not what perlmonks.org is for, and if I am out of line, I will stop posting these questions.

      Will

        And perhaps I'm overstepping the purpose of perlmonks.org? I'm looking for an interesting discussion of ways and means rather than a single solution to a pending problem. Perhaps that is not what perlmonks.org is for, and if I am out of line, I will stop posting these questions.

        You are not overstepping the purpose of perlmonks.org. That Perl Monks is a very different place to Stack Overflow is indicated by this classic quote from Perl Monks pioneer tye:

        Most languages are like stackoverflow: I have a question, I want the best answer. Perl is like PerlMonks: I have a doubt, I want to read an interesting discussion about it that is likely to go on a tangent. q-:

        To improve your regex, as noted by Discipulus here, I suggest you check out every node written by tybalt89 ... oh, and given it provides a "single regular expression that defines a set of independent subpatterns suitable for matching entire Perl documents", you might also enjoy studying PPR, written by TheDamian.

        👁️🍾👍🦟