in reply to Did regex match fail because of "end of string"?

There's no easy way to do this. You could modify the regex engine, or you could modify your regex to check for the appropriate conditions. Even with a regex parser, it might be very tricky to do the latter automatically.

Here's the version of /a\d+b/ with the checks added:

# /a\d+b/ while (<DATA>) { local our $incomplete; my $match = / a (?:$(?{$incomplete=1})(?!)|(?(?{$incomplete})(?!)) \d+ (?:$(?{$incomplete=1})(?!)|(?(?{$incomplete})(?!)) b ) ) /x; my $rv = $match ? "match" : $incomplete ? "incomplete" : "no match"; chomp; printf("%-10s %s\n", $_, $rv); } __DATA__ a123b a a1 a123 a123c a123ca123b a123ca123 a123ca123c
a123b match a incomplete a1 incomplete a123 incomplete a123c no match a123ca123b match a123ca123 incomplete a123ca123c no match

I recommend that you write a tokenizer and parser. If your language doesn't allow line breaks to happen in the middle of a token, the only time you need to read more data is when you're at the end of the buffer when the parser requests a new token.

my $ws = qr/\s+/; sub get_token { my ($self) = @_; for ($self->{buf}) { s/^$ws//; if (length() == 0) { my $fh = $self->{fh}; return [ TOK_EOF ] if eof($fh); $_ .= <$fh>; redo; } s/^([a-zA-Z][a-zA-Z0-9_]*)// && return [ TOK_IDENT, $1 ]; ... } }

If some tokens can contain line breaks, handle those cases specially.

Replies are listed 'Best First'.
Re^2: Did regex match fail because of "end of string"?
by moritz (Cardinal) on Oct 16, 2007 at 21:23 UTC
    I can't rely on the fact the a token won't contain a newline because the user of my (not yet existing) module will decide what a "token" looks like.

    But since the the regexes will always be anchored I can always find out automatically if a match has started by using $match = m/\G(?{ $started = 1 })$re/. Now a way to find the longest submatch that was found (but discarded) would be enough.

    Or is there any other way to match against a stream?

      The construct you are showing is not 'anchored'. The only anchor expressions are '^' (beginning of string) and '$' (end of string). If I am understanding correctly, all you really care about are partial matches at the end of the current available string. Partial matches in the middle are already discarded as non-matches.

      Is there a reason that you cannot simply keep starting from the same location until you receive an end-of-string, or find a match? Can this be more data than you want to hold?

      If you can't do this, I can think of one (very ugly) option. Something like this:

      sub example { $foo = "[&#\$]"; $regex = "a\\d+[ars]{2,4}(aa|ab|ac)"; $string="wle;fnaekf;fla;lkcnovnifa "; $min = $regex."\$"."foo"; if ($min !~ /\$$/) { $min .= '$'; } $match = 0; $tot = length($string); $index = $tot; print "index is $index\n"; while (1) { print "min is $min\n"; eval { if ($string =~ m/$min/g) { $index = pos $string; $match = 1; } }; # print "err is $@\n"; last if $match; $min =~ s/..$//; last if $min eq ""; if ($min !~ /\$$/) { $min .= '$'; } } return $index; } $ind = example();
      You will also have to special-case lines terminated with '\'.
        Is there a reason that you cannot simply keep starting from the same location until you receive an end-of-string, or find a match?

        Yes, I don't know if the regex reached the end of the string and failed, in which case I'd have to load more data.

        Your method seems to be a bit blunt, removing a char blindly from the regex - which leads to many non-valid regexes and big performance penalties. The idea is quite interesting, though ;-)

      $started is always set to 1 in your example.
        Right. I didn't think enough about that one :(