Did regex match fail because of "end of string"?

moritz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Did regex match fail because of "end of string"? by ikegami (Patriarch) on Oct 16, 2007 at 20:59 UTC
There's no easy way to do this. You could modify the regex engine, or you could modify your regex to check for the appropriate conditions. Even with a regex parser, it might be very tricky to do the latter automatically. Here's the version of `/a\d+b/` with the checks added: `# /a\d+b/ while (<DATA>) { local our $incomplete; my $match = / a (?:$(?{$incomplete=1})(?!)\|(?(?{$incomplete})(?!)) \d+ (?:$(?{$incomplete=1})(?!)\|(?(?{$incomplete})(?!)) b ) ) /x; my $rv = $match ? "match" : $incomplete ? "incomplete" : "no match"; chomp; printf("%-10s %s\n", $_, $rv); } __DATA__ a123b a a1 a123 a123c a123ca123b a123ca123 a123ca123c` [download] `a123b match a incomplete a1 incomplete a123 incomplete a123c no match a123ca123b match a123ca123 incomplete a123ca123c no match` [download] I recommend that you write a tokenizer and parser. If your language doesn't allow line breaks to happen in the middle of a token, the only time you need to read more data is when you're at the end of the buffer when the parser requests a new token. `my $ws = qr/\s+/; sub get_token { my ($self) = @_; for ($self->{buf}) { s/^$ws//; if (length() == 0) { my $fh = $self->{fh}; return [ TOK_EOF ] if eof($fh); $_ .= <$fh>; redo; } s/^([a-zA-Z][a-zA-Z0-9_]*)// && return [ TOK_IDENT, $1 ]; ... } }` [download] If some tokens can contain line breaks, handle those cases specially.	[reply] [d/l] [select]
Re^2: Did regex match fail because of "end of string"? by moritz (Cardinal) on Oct 16, 2007 at 21:23 UTC
I can't rely on the fact the a token won't contain a newline because the user of my (not yet existing) module will decide what a "token" looks like. But since the the regexes will always be anchored I can always find out automatically if a match has started by using `$match = m/\G(?{ $started = 1 })$re/`. Now a way to find the longest submatch that was found (but discarded) would be enough. Or is there any other way to match against a stream? Perl 6 in German	[reply] [d/l]
Re^3: Did regex match fail because of "end of string"? by Illuminatus (Curate) on Oct 16, 2007 at 23:27 UTC
The construct you are showing is not 'anchored'. The only anchor expressions are '^' (beginning of string) and '$' (end of string). If I am understanding correctly, all you really care about are partial matches at the end of the current available string. Partial matches in the middle are already discarded as non-matches. Is there a reason that you cannot simply keep starting from the same location until you receive an end-of-string, or find a match? Can this be more data than you want to hold? If you can't do this, I can think of one (very ugly) option. Something like this: sub example { $foo = "[&#\$]"; $regex = "a\\d+[ars]{2,4}(aa\|ab\|ac)"; $string="wle;fnaekf;fla;lkcnovnifa "; $min = $regex."\$"."foo"; if ($min !~ /\$$/) { $min .= '$'; } $match = 0; $tot = length($string); $index = $tot; print "index is $index\n"; while (1) { print "min is $min\n"; eval { if ($string =~ m/$min/g) { $index = pos $string; $match = 1; } }; # print "err is $@\n"; last if $match; $min =~ s/..$//; last if $min eq ""; if ($min !~ /\$$/) { $min .= '$'; } } return $index; } $ind = example(); [download] You will also have to special-case lines terminated with '\'.	[reply] [d/l]
Re^4: Did regex match fail because of "end of string"? by moritz (Cardinal) on Oct 17, 2007 at 05:45 UTC
Re^5: Did regex match fail because of "end of string"? by Illuminatus (Curate) on Oct 17, 2007 at 08:08 UTC
Some notes below your chosen depth have not been shown here
Re^3: Did regex match fail because of "end of string"? by ikegami (Patriarch) on Oct 16, 2007 at 23:03 UTC
`$started` is always set to `1` in your example.	[reply] [d/l] [select]
Re^4: Did regex match fail because of "end of string"? by moritz (Cardinal) on Oct 17, 2007 at 05:39 UTC
Re: Did regex match fail because of "end of string"? by johngg (Canon) on Oct 16, 2007 at 20:50 UTC
I'm not sure if this is going to be of any use but you might be able to detect that the digits are at the end of the string rather than the 'b'. You could do this by making the match for 'b' conditional on whether you've got to the end so that the match doesn't actually fail if the 'b' isn't there. The code may do a better job of explaining what I mean. `#!/usr/bin/perl -l # use strict; use warnings; my @strings = qw{ a123ga123b a123bdfda123 a123effa123 a123 d663h }; my $len; my $rsLen = \$len; my $rxCondMatch = qr {(?x) a (\d+) (?{ print q{digits at end of string} if pos() == $$rsLen }) (??{ if ( pos() != $$rsLen ) { q{b} } }) }; foreach my $string ( @strings ) { $len = length $string; print $string; print q{Match} if $string =~ $rxCondMatch; print q{-} x 20; }` [download] Here's the output. `a123ga123b Match -------------------- a123bdfda123 Match -------------------- a123effa123 digits at end of string Match -------------------- a123 digits at end of string Match -------------------- d663h --------------------` [download] I hope this can be of use to you. Cheers, JohnGG Update: Corrected error in code, testing against `$len` instead of `$$rsLen` in `(?{ ... })` block	[reply] [d/l] [select]
Re^2: Did regex match fail because of "end of string"? by moritz (Cardinal) on Oct 16, 2007 at 21:11 UTC
Thank you for your input, but I'll try to avoid modifying the regexes because they are user input, and I don't want to deparse them. And I don't only want to detect end-of-string between \d+ and 'b', but also between 'a' and \d+ - which means that I'd had to add a closure between any two atoms in the regex - that's not a feasible option :(	[reply]
Re^2: Did regex match fail because of "end of string"? by ikegami (Patriarch) on Oct 16, 2007 at 21:06 UTC
An input of "`a`" isn't flagged as an incomplete match.	[reply] [d/l]
Re: Did regex match fail because of "end of string"? by roboticus (Chancellor) on Oct 16, 2007 at 19:45 UTC
Hmmmm.... I would've thought that if a regular expression match failed, it was because it hit the end of the string without finding a match. Other than program crashes, what else would cause the match to fail? ...roboticus	[reply]
Re^2: Did regex match fail because of "end of string"? by ikegami (Patriarch) on Oct 16, 2007 at 20:03 UTC
I believe the OP wants to know if there was a time when the engine reached the end of the string after starting a match. For `/a\d+b/`, `"a123b\n" -> match "a123\n" -> incomplete match "a123c\n" -> no match` [download]	[reply] [d/l] [select]
Re^3: Did regex match fail because of "end of string"? by roboticus (Chancellor) on Oct 16, 2007 at 20:21 UTC
ikegami: Ah! That interpretation certainly makes sense. (I found it hard to reconcile my interpretation of the question with moritz' experience.) ...roboticus	[reply]
Re^2: Did regex match fail because of "end of string"? by Joost (Canon) on Oct 16, 2007 at 20:02 UTC
If the regex is anchored to the beginning of the string, it could fail without reaching the end. Otherwise, as far as I can imagine it would have to match every available substring* up to the end of the string. * if the match is at least N characters long the match will fail if any of the (last - N) .. last characters of the string don't match and the subsequent characters don't have to be tested. Anyway this is equivalent to running into the end of string. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^3: Did regex match fail because of "end of string"? by moritz (Cardinal) on Oct 16, 2007 at 20:09 UTC
Right, all matches in the tokenizer will be anchored to pos with `\G`.	[reply] [d/l]
Re^4: Did regex match fail because of "end of string"? by Joost (Canon) on Oct 16, 2007 at 20:33 UTC
Re: Did regex match fail because of "end of string"? by Illuminatus (Curate) on Oct 16, 2007 at 20:33 UTC
I don't know what you mean by "end of string." Unless you anchor the expression using the ^, it is going to check until the end of the string. Now, it might not literally read the last char, depending on the regex. For 3 chars (min in your example), it probably will not, unless the last three chars are 'a\d^\d'. I don't think that there is a way, in general, to figure out if the end of the string is a partial match (say, 'a\d\d'), other than to do subsequent subset matches, using $ at the end. Illuminatus	[reply]