Re: Did regex match fail because of "end of string"?
by ikegami (Patriarch) on Oct 16, 2007 at 20:59 UTC
|
There's no easy way to do this. You could modify the regex engine, or you could modify your regex to check for the appropriate conditions. Even with a regex parser, it might be very tricky to do the latter automatically.
Here's the version of /a\d+b/ with the checks added:
# /a\d+b/
while (<DATA>) {
local our $incomplete;
my $match = /
a
(?:$(?{$incomplete=1})(?!)|(?(?{$incomplete})(?!))
\d+
(?:$(?{$incomplete=1})(?!)|(?(?{$incomplete})(?!))
b
)
)
/x;
my $rv = $match ? "match"
: $incomplete ? "incomplete"
: "no match";
chomp;
printf("%-10s %s\n", $_, $rv);
}
__DATA__
a123b
a
a1
a123
a123c
a123ca123b
a123ca123
a123ca123c
a123b match
a incomplete
a1 incomplete
a123 incomplete
a123c no match
a123ca123b match
a123ca123 incomplete
a123ca123c no match
I recommend that you write a tokenizer and parser. If your language doesn't allow line breaks to happen in the middle of a token, the only time you need to read more data is when you're at the end of the buffer when the parser requests a new token.
my $ws = qr/\s+/;
sub get_token {
my ($self) = @_;
for ($self->{buf}) {
s/^$ws//;
if (length() == 0) {
my $fh = $self->{fh};
return [ TOK_EOF ] if eof($fh);
$_ .= <$fh>;
redo;
}
s/^([a-zA-Z][a-zA-Z0-9_]*)// && return [ TOK_IDENT, $1 ];
...
}
}
If some tokens can contain line breaks, handle those cases specially.
| [reply] [d/l] [select] |
|
|
I can't rely on the fact the a token won't contain a newline because the user of my (not yet existing) module will decide what a "token" looks like.
But since the the regexes will always be anchored I can always find out automatically if a match has started by using
$match = m/\G(?{ $started = 1 })$re/.
Now a way to find the longest submatch that was found (but discarded) would be enough.
Or is there any other way to match against a stream?
| [reply] [d/l] |
|
|
The construct you are showing is not 'anchored'. The only anchor expressions are '^' (beginning of string) and '$' (end of string). If I am understanding correctly, all you really care about are partial matches at the end of the current available string. Partial matches in the middle are already discarded as non-matches.
Is there a reason that you cannot simply keep starting from the same location until you receive an end-of-string, or find a match? Can this be more data than you want to hold?
If you can't do this, I can think of one (very ugly) option. Something like this:
sub
example {
$foo = "[&#\$]";
$regex = "a\\d+[ars]{2,4}(aa|ab|ac)";
$string="wle;fnaekf;fla;lkcnovnifa ";
$min = $regex."\$"."foo";
if ($min !~ /\$$/) {
$min .= '$';
}
$match = 0;
$tot = length($string);
$index = $tot;
print "index is $index\n";
while (1) {
print "min is $min\n";
eval {
if ($string =~ m/$min/g) {
$index = pos $string;
$match = 1;
}
};
# print "err is $@\n";
last if $match;
$min =~ s/..$//;
last if $min eq "";
if ($min !~ /\$$/) {
$min .= '$';
}
}
return $index;
}
$ind = example();
You will also have to special-case lines terminated with '\'.
| [reply] [d/l] |
|
|
|
|
|
|
|
$started is always set to 1 in your example.
| [reply] [d/l] [select] |
|
|
Re: Did regex match fail because of "end of string"?
by johngg (Canon) on Oct 16, 2007 at 20:50 UTC
|
I'm not sure if this is going to be of any use but you might be able to detect that the digits are at the end of the string rather than the 'b'. You could do this by making the match for 'b' conditional on whether you've got to the end so that the match doesn't actually fail if the 'b' isn't there. The code may do a better job of explaining what I mean.
#!/usr/bin/perl -l
#
use strict;
use warnings;
my @strings = qw{
a123ga123b
a123bdfda123
a123effa123
a123
d663h
};
my $len;
my $rsLen = \$len;
my $rxCondMatch = qr
{(?x)
a
(\d+)
(?{ print q{digits at end of string} if pos() == $$rsLen })
(??{ if ( pos() != $$rsLen ) { q{b} } })
};
foreach my $string ( @strings )
{
$len = length $string;
print $string;
print q{Match} if $string =~ $rxCondMatch;
print q{-} x 20;
}
Here's the output.
a123ga123b
Match
--------------------
a123bdfda123
Match
--------------------
a123effa123
digits at end of string
Match
--------------------
a123
digits at end of string
Match
--------------------
d663h
--------------------
I hope this can be of use to you. Cheers, JohnGG
Update: Corrected error in code, testing against $len instead of $$rsLen in (?{ ... }) block | [reply] [d/l] [select] |
|
|
| [reply] |
|
|
An input of "a" isn't flagged as an incomplete match.
| [reply] [d/l] |
Re: Did regex match fail because of "end of string"?
by roboticus (Chancellor) on Oct 16, 2007 at 19:45 UTC
|
Hmmmm....
I would've thought that if a regular expression match failed, it was because it hit the end of the string without finding a match. Other than program crashes, what else would cause the match to fail?
...roboticus | [reply] |
|
|
"a123b\n" -> match
"a123\n" -> incomplete match
"a123c\n" -> no match
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
| [reply] |
|
|
Right, all matches in the tokenizer will be anchored to pos with \G.
| [reply] [d/l] |
|
|
Re: Did regex match fail because of "end of string"?
by Illuminatus (Curate) on Oct 16, 2007 at 20:33 UTC
|
I don't know what you mean by "end of string." Unless you anchor the expression using the ^, it is going to check until the end of the string. Now, it might not literally read the last char, depending on the regex. For 3 chars (min in your example), it probably will not, unless the last three chars are 'a\d^\d'. I don't think that there is a way, in general, to figure out if the end of the string is a partial match (say, 'a\d\d'), other than to do subsequent subset matches, using $ at the end.
Illuminatus | [reply] |