jdoege has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Monks,

How can I tell if a regular expression is matching but fails merely because it ran out of data?

For instance, a lexer might have tokens that can cross line boundaries such as strings and comments. Not wanting to slurp the file in one go so as to minimize memory footprint, this lexer has a line buffer which it keeps tidy, getting more lines as needed and dereferencing lines when no longer needed. In such a case, it is possible that a regex match might start but not finish before needing more data in the line buffer. In such a case, I'd want to read in another line and try the match again or, somehow, continue the match. In either case I need to differentiate between the match failing due to bad content versus failing due to lack of content.

Any ideas?

Update...

Thank you for the good ideas. I am going to look into using the (?=\Z(?{ somecode }))? pattern for this. I agree that false positive, incomplete matches are also a problem. I think a bit of code to modify the pattern, automatically inserting the above snippet after every '{}', '+' or '*' quantifiers and having that code cause reading more data and re-trying the match every time when triggered will satisfy both the incomplete match and the fail due to end-of-buffer problem.

Update...2008-02-28 08:10 CST: Noted the all important '?' quantifier was missing from the code snippet. Not now!

Replies are listed 'Best First'.
Re: How regexes fail.
by ikegami (Patriarch) on Feb 25, 2008 at 21:37 UTC

    A lexer returns tokens, and tokens are rather simple. So simple that parsing them requires no backtracking. This makes it very easy to not use regexps.

    But it also your task possible (but not simple) with regexps too. Take Perl single-quoted strings:

    sub squote { my $s_pos = pos; my $match = / \G ' (?> (?: \\. | [^\\'] )* ) ' /xsgc; my $e_pos = pos; if ($match) { $_[0] = substr($_, $s_pos, $e_pos-$s_pos); return 1; } else { pos = $s_pos; return 0; } }
    becomes
    sub squote { my $s_pos = pos; RETRY: { local our $at_eof = ...; local our $need_more = 0; my $match = / \G (?: \z (?{ $need_more = !$at_eof }) )? ' (?> (?: \\ (?: \z (?{ $need_more = !$at_eof }) )? . | [^\\'] )* ) (?: \z (?{ $need_more = !$at_eof }) )? ' /xsgc; if ($need_more) { ... pos = $s_pos; redo RETRY; } my $e_pos = pos; if ($match) { $_[0] = substr($_, $s_pos, $e_pos-$s_pos); return 1; } else { pos = $s_pos; return 0; } } }

    By the way, you asked
    "How can I tell if a regular expression is matching but fails merely because it ran out of data?"
    but you should also be asking
    "How can I tell if a regular expression succeeded merely because it ran out of data?"

    Take an identifier, for example. If $_ contains 'abc', it'll return a match, but it'll return the wrong match if the next character is 'd'.

    sub ident { my $s_pos = pos; my $match = / \G [a-zA-Z_] [a-zA-Z0-9_]* /xsgc; my $e_pos = pos; if ($match) { $_[0] = substr($_, $s_pos, $e_pos-$s_pos); return 1; } else { pos = $s_pos; return 0; } }

    becomes

    sub ident { my $s_pos = pos; RETRY: { local our $at_eof = ...; local our $need_more = 0; my $match = / \G (?: \z (?{ $need_more = !$at_eof }) )? [a-zA-Z_] (?> [a-zA-Z0-9_]* ) (?: (?= . ) | (?{ $need_more = !$at_eof }) ) /xsgc; if ($need_more) { ... pos = $s_pos; redo RETRY; } my $e_pos = pos; if ($match) { $_[0] = substr($_, $s_pos, $e_pos-$s_pos); return 1; } else { pos = $s_pos; return 0; } } }

    Update: Simple optimization.

Re: How regexes fail.
by tilly (Archbishop) on Feb 25, 2008 at 23:25 UTC
Re: How regexes fail.
by kyle (Abbot) on Feb 25, 2008 at 20:55 UTC

    You may want to look at Matching in huge files, which uses a sliding window technique to match in large files.

      A sliding window is useless, since a lexer needs to match at the end of the last found token. However, the linked snippet could be modified to *expand* it's window instead of sliding it.

      I look at that and do not see that it differentiates between regex failure modes. That is, I think I see in the code that if a regex fails, it simply gets more data for the window and trys again. I would only want this to happen if the regex failed only by reaching the end of the window. Otherwise, a lexer should try a different token pattern to match.

      Or am I missing something?
Re: How regexes fail.
by TGI (Parson) on Feb 25, 2008 at 21:30 UTC

    The first thing that comes to mind is to have a bunch of optional captures.

    while (1) { $foo .= <HANDLE> $foo =~ /(bar(baz(buz)?)?)/; print "bar found\n" if defined $1; print "baz found\n" if defined $2; print "buz found\n" if defined $3; last if defined $3; }

    Of course this could get really ugly if you have big, complex regexes to work with. I'm willing to bet that there are other caveats as well, but I'm not sure what they might be.


    TGI says moo

Re: How regexes fail.
by johngg (Canon) on Feb 25, 2008 at 23:34 UTC
    Something similar was asked in this thread. Perhaps it will help you.

    Cheers,

    JohnGG