jehuni has asked for the wisdom of the Perl Monks concerning the following question:

So, let's say I need a single regular expression to both check for a pattern match within a string and to verify that the total length of the string does not exceed a given number of characters. Yes, I know that it would be better to check the length with a separate call to length, but in this case I need one regex and nothing more.

Here's what I came up with, but I was curious as to whether there was a better way to do this. This example assumes that you need a string with the word "foo" somewhere in it, but the total length of the string itself can't be more than 50 characters.

/^(?=.{0,50}$).*foo/i

-jehuni

Replies are listed 'Best First'.
Re: validating string length with a regular expression
by tachyon (Chancellor) on Mar 18, 2002 at 12:24 UTC

    Test your code with this string (has a foo and is less than 50 chars so should match):

    $_ = "foo\n Oops"; print /^(?=.{0,50}$).*foo/i ? "Matches!" : $_;

    This fixes your regex:

    /(?=^.{0,50}\z).*foo/si

    \z only matches the end of the string (unlike $ which will match an embedded \n) and /s lets . match everything, including \n. Strictly you don't need the \z in this context and could leave the $ but it is good to know the difference. As you want to match foo it is more efficient to move the ^ .... \z into the lookahead which removes the need for the .* Death to dot star! Oops, updated per jehuni's comment below

    This is a really silly way to do it that fulfils your criteria of being a single regex :-)

    $str =~ s/(foo)/&do_stuff($str) if length $1 < 50; $1/e;

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Even though I don't expect to come across newline characters, your suggestion of using \z and /s is a good one. I definitely only want strings of 50 characters or less -- no matter what characters they are -- to match.

      Unfortunately, I don't see a way around using the dreaded .* before the actual pattern I'm looking for. Otherwise it will only match when the pattern occurs at the beginning of the string, since (?=^.{0,50}\z) is a zero-width assertion that's anchored to the start of the string. My original version had the ^ anchor inside the lookahead, like yours, but since it seemed that I had to use .* in either case, I decided to move it outside of the lookahead. However, it's probably clearer to leave it inside.

      Also, to clarify my question further, I actually need a matching regex and not a substitution regex. Maybe this is more like golf than I originally realized ...

      -jehuni

        Point taken about the anchoring. Insufficient testing with unusual edge case acknowledged :o)

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: validating string length with a regular expression
by broquaint (Abbot) on Mar 18, 2002 at 11:56 UTC
    Hmmm, that looks for 0-50 characters after the beginning of the string up to the end of the string, then it matches any character as many times as possible up to the last 'foo' in the string. So I guess that'll match your requirements (and testing seems to prove so).
    I can't think of a way to nicely match both length *and* a string with one regex (but then again, I'm no regex ninja ;-), so perhaps this would suffice.
    &do_stuff($str) if length($str) <= 50 and $str =~ /foo/;
    For a far better explanation of the regex check out japhy's superb YAPE::Regex::Explain.
    HTH

    broquaint

•Homework alert! Re: validating string length with a regular expression
by merlyn (Sage) on Mar 18, 2002 at 17:06 UTC
    but in this case I need one regex and nothing more.
    I always question design goals such as these. I distrust question posers who give such artificial requirements. It's a bit like watching an episode of MacGuyver, in how the responses come out.

    Trouble is, programming isn't MacGuyver. We do have the ability to include a call to length somewhere, so why the artificial requirement?

    Two reasons come to mind:

    1. The design that forced this decision is bad, in which case we must know more about the context to help redesign that part of the program, or
    2. It's Homework, in which case we should not answer the question at all.

    I smell homework. Please prove otherwise, by stating the bad design decision more fully.

    -- Randal L. Schwartz, Perl hacker

      Here's the reason: I'm using Data::FormValidator to validate HTML form input. For those not familiar with this module, basically it allows you to pass in a validation profile which contains various "input specifications" that tell it how to validate your data. There is a "constraints" input specification which allows you to specify validation constraints, including coderefs, so no problem using length there (see the example below). However, there is also an input specification "constraint_regexp_map" which allows you to apply a constraint to any fields whose names match a supplied regex (also in the example below). Unfortunately, in this case, you cannot pass it a coderef -- only a regex or the name of a built-in (built into Data::FormValidator, that is) validation function.

      Here's an example profile:

      my $profile = { index => { required => [ qw(firstname surname address1 postcode email) ], optional => [ qw(middlename address2 address3 address4) ], constraints => { postcode => '/^[A-Z]{1,2}\d[A-Z\d]?\s*\d[A-Z]{2}$/i', email => { constraint => sub { return valid_email($_[0]) && length($_[0]) <= 100; }, params => [ 'email' ], }, constraint_regexp_map => { '/name$/' => '/(?=^.{0,25}$)[[:print:]]*$/i', '/^address/' => '/(?=^.{0,50}$)[[:print:]]*$/i', }, }, };

      So, the answer is probably 1) it's a poor design on the part of Data::FormValidator. I looked at the internals of Data::FormValidator, and I could patch it to accept coderefs, but at the time it was more work than I was willing to do. As in the example above, the "constraints" input specification allows you to supply both a coderef and a list of parameters, which have to be names of form fields. It then calls your coderef with the values of those fields as the parameters. In most cases, you would obviously want to pass it the name of the field to which the constraint applies, although it's not required.

      I pondered adding support for backreferences within the names of the form fields, so you could have it match /^address(.*)$/ and then pass it "address$1" as a param. However, due to issues with scoping and eval and etc. and etc., I decided not to mess with a patch for now and just see if I could come up with a single matching regex. Hence the question.

      -jehuni

        Rather than a callback coderef, you could simply add a "max length" parameter. That'd be a little more specific, and in line with the other parameters.

        So, it was "bad design" rather than homework. Yup. Was equally likely in my book, hence the question.

        -- Randal L. Schwartz, Perl hacker

      <wild speculation>

      Although it's not my question, I can imagine a situation where someone is validating a set of variables which contain user-supplied data. A simple design for validating them might involve a hash table which correlates each data type to an "allowable" regex pattern.

      Not that there aren't ways around this (perhaps using subroutines instead of regexes). But I'll admit that I've done it before, cramming a lot of data validation into a single regex for this purpose.

      </wild speculation>

      Update: I'm good. But I'm slow. Ah, well...

      buckaduck

        Bull's eye!

        Anyway, this is my first experience with Data::FormValidator (formerly HTML::FormValidator, I believe). It seems to do what I want, but if anyone has any other recommendations for a generic sort of input validation module, please share. My use of Data::FormValidator is based partly on a Super Search of perlmonks, so I'd love to hear of any other possibilities that I may have overlooked.

        -jehuni

      It could just be a badly-worded golf.
      #234567890#234567890#234567890 $_=$foo; $_&&length<=50&&/foo/; /(?=.{0,50}\z)foo/s;
      Take your pick. :-)

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.