moritz has asked for the wisdom of the Perl Monks concerning the following question:

Is there a way to know if a regex match failed because it reached the end of the string?

For example if I have the the regex /a\d+b/ that matches the string a123 the regex fails, and the regex engine inspected the end of the string. Is there a way to tell if that has happened?

I want to check this because I want to write a tokenizer that reads a file line by line, and matches user supplied regexes against the string.

Now I don't know in advance how many lines to read before trying to match, so the idea is to check whether the regex failed while being in the middle of a regex.

If there's a better solution to that I'm happy as well ;-)

Replies are listed 'Best First'.
Re: Did regex match fail because of "end of string"?
by ikegami (Patriarch) on Oct 16, 2007 at 20:59 UTC

    There's no easy way to do this. You could modify the regex engine, or you could modify your regex to check for the appropriate conditions. Even with a regex parser, it might be very tricky to do the latter automatically.

    Here's the version of /a\d+b/ with the checks added:

    # /a\d+b/ while (<DATA>) { local our $incomplete; my $match = / a (?:$(?{$incomplete=1})(?!)|(?(?{$incomplete})(?!)) \d+ (?:$(?{$incomplete=1})(?!)|(?(?{$incomplete})(?!)) b ) ) /x; my $rv = $match ? "match" : $incomplete ? "incomplete" : "no match"; chomp; printf("%-10s %s\n", $_, $rv); } __DATA__ a123b a a1 a123 a123c a123ca123b a123ca123 a123ca123c
    a123b match a incomplete a1 incomplete a123 incomplete a123c no match a123ca123b match a123ca123 incomplete a123ca123c no match

    I recommend that you write a tokenizer and parser. If your language doesn't allow line breaks to happen in the middle of a token, the only time you need to read more data is when you're at the end of the buffer when the parser requests a new token.

    my $ws = qr/\s+/; sub get_token { my ($self) = @_; for ($self->{buf}) { s/^$ws//; if (length() == 0) { my $fh = $self->{fh}; return [ TOK_EOF ] if eof($fh); $_ .= <$fh>; redo; } s/^([a-zA-Z][a-zA-Z0-9_]*)// && return [ TOK_IDENT, $1 ]; ... } }

    If some tokens can contain line breaks, handle those cases specially.

      I can't rely on the fact the a token won't contain a newline because the user of my (not yet existing) module will decide what a "token" looks like.

      But since the the regexes will always be anchored I can always find out automatically if a match has started by using $match = m/\G(?{ $started = 1 })$re/. Now a way to find the longest submatch that was found (but discarded) would be enough.

      Or is there any other way to match against a stream?

        The construct you are showing is not 'anchored'. The only anchor expressions are '^' (beginning of string) and '$' (end of string). If I am understanding correctly, all you really care about are partial matches at the end of the current available string. Partial matches in the middle are already discarded as non-matches.

        Is there a reason that you cannot simply keep starting from the same location until you receive an end-of-string, or find a match? Can this be more data than you want to hold?

        If you can't do this, I can think of one (very ugly) option. Something like this:

        sub example { $foo = "[&#\$]"; $regex = "a\\d+[ars]{2,4}(aa|ab|ac)"; $string="wle;fnaekf;fla;lkcnovnifa "; $min = $regex."\$"."foo"; if ($min !~ /\$$/) { $min .= '$'; } $match = 0; $tot = length($string); $index = $tot; print "index is $index\n"; while (1) { print "min is $min\n"; eval { if ($string =~ m/$min/g) { $index = pos $string; $match = 1; } }; # print "err is $@\n"; last if $match; $min =~ s/..$//; last if $min eq ""; if ($min !~ /\$$/) { $min .= '$'; } } return $index; } $ind = example();
        You will also have to special-case lines terminated with '\'.
        $started is always set to 1 in your example.
Re: Did regex match fail because of "end of string"?
by johngg (Canon) on Oct 16, 2007 at 20:50 UTC
    I'm not sure if this is going to be of any use but you might be able to detect that the digits are at the end of the string rather than the 'b'. You could do this by making the match for 'b' conditional on whether you've got to the end so that the match doesn't actually fail if the 'b' isn't there. The code may do a better job of explaining what I mean.

    #!/usr/bin/perl -l # use strict; use warnings; my @strings = qw{ a123ga123b a123bdfda123 a123effa123 a123 d663h }; my $len; my $rsLen = \$len; my $rxCondMatch = qr {(?x) a (\d+) (?{ print q{digits at end of string} if pos() == $$rsLen }) (??{ if ( pos() != $$rsLen ) { q{b} } }) }; foreach my $string ( @strings ) { $len = length $string; print $string; print q{Match} if $string =~ $rxCondMatch; print q{-} x 20; }

    Here's the output.

    a123ga123b Match -------------------- a123bdfda123 Match -------------------- a123effa123 digits at end of string Match -------------------- a123 digits at end of string Match -------------------- d663h --------------------

    I hope this can be of use to you.

    Cheers,

    JohnGG

    Update: Corrected error in code, testing against $len instead of $$rsLen in (?{ ... }) block

      Thank you for your input, but I'll try to avoid modifying the regexes because they are user input, and I don't want to deparse them.

      And I don't only want to detect end-of-string between \d+ and 'b', but also between 'a' and \d+ - which means that I'd had to add a closure between any two atoms in the regex - that's not a feasible option :(

      An input of "a" isn't flagged as an incomplete match.
Re: Did regex match fail because of "end of string"?
by roboticus (Chancellor) on Oct 16, 2007 at 19:45 UTC
    Hmmmm....

    I would've thought that if a regular expression match failed, it was because it hit the end of the string without finding a match. Other than program crashes, what else would cause the match to fail?

    ...roboticus

      I believe the OP wants to know if there was a time when the engine reached the end of the string after starting a match.

      For /a\d+b/,

      "a123b\n" -> match "a123\n" -> incomplete match "a123c\n" -> no match
        ikegami:

        Ah! That interpretation certainly makes sense. (I found it hard to reconcile my interpretation of the question with moritz' experience.)

        ...roboticus

      If the regex is anchored to the beginning of the string, it could fail without reaching the end. Otherwise, as far as I can imagine it would have to match every available substring* up to the end of the string.

      * if the match is at least N characters long the match will fail if any of the (last - N) .. last characters of the string don't match and the subsequent characters don't have to be tested. Anyway this is equivalent to running into the end of string.

        Right, all matches in the tokenizer will be anchored to pos with \G.
Re: Did regex match fail because of "end of string"?
by Illuminatus (Curate) on Oct 16, 2007 at 20:33 UTC
    I don't know what you mean by "end of string." Unless you anchor the expression using the ^, it is going to check until the end of the string. Now, it might not literally read the last char, depending on the regex. For 3 chars (min in your example), it probably will not, unless the last three chars are 'a\d^\d'. I don't think that there is a way, in general, to figure out if the end of the string is a partial match (say, 'a\d\d'), other than to do subsequent subset matches, using $ at the end. Illuminatus