raymorris has asked for the wisdom of the Perl Monks concerning the following question:

Is there any way in Perl to match the leading part of a regex? I see that in C it can be done with PCRE, but I'd like to do it in Perl. I am attempting to determine whether a string matches the BEGINNING of an arbitrary regular expression. In other words, is it true that "it matches so far, we'll have to look at the rest of the string to see if the entire regex matches". A few examples might make it more clear, using the fake operator ^~ to mean "leading match":

'bob' ^~ /^bobby/ = true 'bob' ^~ /^fred/ = false 'bob' ^~ /^bo*[a-z]./ = true 'bob' ^~ /^ch*/ = false

It only needs to work for regular expression which are anchored at the beginning. Obviously if the regex isn't anchored to the beginning ANY string could match, because:

/foo/ = /.*foo.*/ therefore: 'bob' ^~ /foo/ = true

Any ideas on how to do this? The practical application is that I have many regexes such as this:

/^c:\\Windows\\Program Files\\blah[0-9]\\setup.exe/

Walking the drive, if I come to c:\Windows\Program Files\, that is a leading match, so I SHOULD recurse into the directory to see if it contains blah[0-9]\ On the other hand, I should not recurse into c:\Temp, because nothing starting with "c:\Temp\" can ever match ^c:\\windows\\Program ...

The naive/wrong (and current) implementation is to split the regex on \\ and match directory names. That obviously fails in many cases, such as the regex /windows(\\system32)?\\bob/ .

It looks like PCRE can do it in C, but I'd like to do it in Perl: http://www.pcre.org/current/doc/html/pcre2partial.html

Any ideas?

Replies are listed 'Best First'.
Re: Regex partial/leading match
by choroba (Cardinal) on Dec 31, 2015 at 22:18 UTC
    I'm not sure this works for you, but it might: Just try all the shorter regexes if they're valid.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $bob = 'bob'; sub match { my ($string, $regex) = @_; say $string, "\t", $regex; my @chars = split //, $regex; for my $pos (1 .. $#chars) { my $re_part = join q(), @chars[0 .. $pos]; return 1 if eval { $string =~ /^($re_part)$/ and length $1 }; } return 0 } for my $regex ( 'bobly', 'bo*[a-z].', 'bob', 'bo(x)?bcd', 'fred', 'o*', ) { say match('bob', $regex); }
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thank you, unfortunately that works only for simple strings, not regexes. Consider:
      'bated' =~ /^bat{3}/; false 'bated' =~ /^bat/; true
      I'm trying to look at the beginning of a string to know if it can potentially match a regex, without knowing the full string. Which is of course different from making a completely different regex with substr($regex.

      (Edited for bad examples)

Re: Regex partial/leading match
by stevieb (Canon) on Dec 31, 2015 at 21:21 UTC

    Hi raymorris, welcome to the Monastery, and Happy New Year!

    I could be wrong here, but this looks like an XY Problem, and I suspect you may be looking at the issue backwards. Could you show us some code so we can see what you're trying to do in a real case?

    Cheers,

    -stevieb

      It is possible it's an XY problem, which is why I gave the example usage. From an external source, we have a large number of regular expressions. Two examples are:
      /^C:\\Windows\\Program Files\\blah[0-9]\\setup.exe/ /^C:\\Windows(\\SYSTEM)?\\Foo\\Bar.dll/
      The second example is probably most interesting. It matches either C:\\Windows\Foo\Bar.dll or C:\Windows\SYSTEM\Foo\Bar.dll.

      The regular expressions are from an external source, so we can't change the fact that we get them in that format. We must recurse through a drive to find files matching the expression.

      Suppose we come across the directory C:\Temp\ . Intuitively, we know we don't need to recurse into C:\Temp\ because nothing in that directory can match either regex. Because we don't have a leading match, we should return false immediately. On the other hand, when we come to C:\Windows\, we SHOULD recurse, because it matches the leading part of the regex and we may find a full match if we keep going. This is obvious to the human, the trick is how to tell Perl to skip anything that can't match (even as more characters are added to the END of the string).

      I'm trying to think of any way to go "up another level", to see the problem from a higher view, but there really isn't anything I can think of. The regular expressions are externally supplied and we must find files on a drive which match them. For efficiency, we wish to avoid looking for files in directories that can't possibly match. The regex engine does this internally, I believe, but I don't know if it exposes the "matched length" of an unmatched regex to Perl.

      PS - I wish I could still log into my account from 2003. :(

        Part of your problem here, is that you have more information than you're giving the computer. i.e. they're not just regex but file system path expressions.

        You might take advantage of that knowledge and decompose the regex into a set of File::Find::Rule rules, obviously you'll have to write the parser youself, but hopefully all the regexes will be quite similar and there will be common patterns that you can spot and translate into rules.

        So you might end up with something like :-

        my @dirs = File::Find::Rule->directory()->name(qr/blah[0-9]/)->in(' +C:\\Windows\\Program Files'); my @files = File::Find::Rule->file()->name('setup.exe')->in(@dirs);

        It's an interesting problem and well worth spending some time on.

Re: Regex partial/leading match
by Laurent_R (Canon) on Jan 02, 2016 at 09:42 UTC
    PCRE stands for Perl Compatible Regular Expressions. It is a library that has been written to enable users of other languages to use the power of Perl's regular expressions. So, PCRE is just mimicking the Perl regular expressions, and, basically, anything you can do with PCRE can be done with Perl. In other words, if you really know how to do it in C with PCRE, then you should be able to do it very easily in Perl.

    The issue we have here is that your problem is not very well defined, so that it is difficult to suggest a solution.

    If I understand your issue correctly, you are not mastering the regexes that you are going to use, they are provided to you by an external source. If such is the case, then you can't do very much about it, it is the author of these regex that ought to provide you with the proper set of patterns.

    Well, a general solution to this problem is probably out of reach, but I am also not saying that it can't be done. In theory at least, you might be able to parse the patterns supplied to you to build beginning-of-string regex sub-patterns provided those patterns supplied to you are relatively simple and very well defined, but that is not very easy and probably not a very robust solution, because you're probably bound to fail if the author of the patterns provided to you decides to start to be a bit too clever. OTOH, if you can define a very limited subset of authorized patterns, then it can probably be done. The fact that you seem to be willing to analyze directory paths is certainly essential to a possible definition of such simple patterns provided to you and therefore to a possible solution to your problem.

    But you don't give us enough information for us to really be able to help you much further.

      No. And no.

      The OP did provide enough information. He may be "guilty" of abusing the regex for a task typically solved otherwise, but the problem is understood and so is the desire for it. Variations on the theme have been posted before: "can I determine the point of matching failure? The longest tentative match?"

      I'm not intimately familiar with the guts of rx matching, but here's how I reckon this: the rx engine does its best to try to avoid any sort of delay or inefficiency in backtracking. It does not remember where it fails. So this feature is something that PCRE does support and perl does not.

      One "solution" that might work for the OP, is to rewrite the pattern and inject

      (?(?{pos==length})(*ACCEPT))
      before every \\, but that's neither generic nor tidy.

Re: Regex partial/leading match
by Mr. Muskrat (Canon) on Dec 31, 2015 at 21:25 UTC
      Do you know of answer that's actually in one of those links? I've read each of those pages several times over the last 18 years and I can't think of any obvious answer covered there. Certainly there may be something, but nothing obvious, I'm pretty sure.

        Yes. I do know that the answer is there if you look hard enough. You might have to read between the lines though.

        So you really did read those docs? If so, you found that regexes don't work the way you want them to. That means that you have to find another way to achieve the end result.

        Updated: s/they way/the way/

Re: Regex partial/leading match
by Anonymous Monk on Jan 02, 2016 at 12:37 UTC

    In one of your replies you wrote:

    I'm trying to think of any way to go "up another level", to see the problem from a higher view

    Here's a try: You're being given regexes to match against all the filenames in the system, are now looking for a way to optimize that, have discovered some caveats, etc. - maybe you're stuck in the world of string matching? The actual task is searching the filesystem, a task with plenty of existing solutions.

    How complex are the regexes you're getting? Do they only make use of simpler regex features like . [] (|) ? * +, or do they use the more powerful features like look-arounds, backreferences, and such?

    If the former, then my suggestion would be to parse the regexes and transform them to rules appropriate for a filesystem search. Just one example, if this were some kind of backup tool, you could transform the rules into include/exclude patterns appropriate for rsync.

Re: Regex partial/leading match
by Anonymous Monk on Jan 02, 2016 at 13:28 UTC

    And a thought to add to the anon above. Did you consider globbing in stead of regexen? It is far more typical to give path specifications in a simpler form. Shell style globbing, rsync patterns, etc.