in reply to Re: Did regex match fail because of "end of string"?
in thread Did regex match fail because of "end of string"?

I can't rely on the fact the a token won't contain a newline because the user of my (not yet existing) module will decide what a "token" looks like.

But since the the regexes will always be anchored I can always find out automatically if a match has started by using $match = m/\G(?{ $started = 1 })$re/. Now a way to find the longest submatch that was found (but discarded) would be enough.

Or is there any other way to match against a stream?

Replies are listed 'Best First'.
Re^3: Did regex match fail because of "end of string"?
by Illuminatus (Curate) on Oct 16, 2007 at 23:27 UTC
    The construct you are showing is not 'anchored'. The only anchor expressions are '^' (beginning of string) and '$' (end of string). If I am understanding correctly, all you really care about are partial matches at the end of the current available string. Partial matches in the middle are already discarded as non-matches.

    Is there a reason that you cannot simply keep starting from the same location until you receive an end-of-string, or find a match? Can this be more data than you want to hold?

    If you can't do this, I can think of one (very ugly) option. Something like this:

    sub example { $foo = "[&#\$]"; $regex = "a\\d+[ars]{2,4}(aa|ab|ac)"; $string="wle;fnaekf;fla;lkcnovnifa "; $min = $regex."\$"."foo"; if ($min !~ /\$$/) { $min .= '$'; } $match = 0; $tot = length($string); $index = $tot; print "index is $index\n"; while (1) { print "min is $min\n"; eval { if ($string =~ m/$min/g) { $index = pos $string; $match = 1; } }; # print "err is $@\n"; last if $match; $min =~ s/..$//; last if $min eq ""; if ($min !~ /\$$/) { $min .= '$'; } } return $index; } $ind = example();
    You will also have to special-case lines terminated with '\'.
      Is there a reason that you cannot simply keep starting from the same location until you receive an end-of-string, or find a match?

      Yes, I don't know if the regex reached the end of the string and failed, in which case I'd have to load more data.

      Your method seems to be a bit blunt, removing a char blindly from the regex - which leads to many non-valid regexes and big performance penalties. The idea is quite interesting, though ;-)

        Yes, I don't know if the regex reached the end of the string and failed, in which case I'd have to load more data.

        If the match test fails then you have reached the end of the string without a match, unless the regex begins with a '^'. If you disallow this, you should be fine:

        $str=""; while(<>){ $str.=$_; last if (m/a\d+b/g); }
        The end of the string could be a partial match at the end, but you don't care, because the next string catenation will either allow a match, or discard it (depending on what the new data turns out to be).

        The previous ugly example is most likely the only other solution. It may not be as cpu-expensive as you think. Since each iteration is anchored at the end-of-string, it will not match against the whole string in general. The invalid regex's will bail immediately, without matching a thing.

Re^3: Did regex match fail because of "end of string"?
by ikegami (Patriarch) on Oct 16, 2007 at 23:03 UTC
    $started is always set to 1 in your example.
      Right. I didn't think enough about that one :(