in reply to Re: Possible to have regexes act on file directly (not in memory)
in thread Possible to have regexes act on file directly (not in memory)

... I am forbidden to assume anything regarding their contents or structure ...

This means you may not assume that in a file/string 1 TB in length, the length of a match is less than the length of the string. Or in code, writing something that could do this:

c:\@Work\Perl\monks>perl -wMstrict -le "my $s = '<1e12 chars, none of which is a left or right angle bracket>'; ;; print qq{match, captured '$1'} if $s =~ m{ < ([^<>]*) > }xms; " match, captured '1e12 chars, none of which is a left or right angle br +acket'

My theoretical regex-fu is weak, but as I understand it, doing this with the type of regex engine (RE) that Perl uses, an NFA, would be impossible without a fundamental and total re-write of the RE code.

However, a DFA RE would, I imagine, be a different story. Insofar as I understand it, a DFA operates on a single character at a time without backtracking. It is the state-machine approach you mention above. The capabilities of a DFA RE are much more limited than Perl's much-enhanced (and no longer 'regular') NFA RE. However, I believe the example regex above is compatible with both NFA and DFA REs.

If your regex could be expressed in terms acceptable to a DFA RE, there are engines already available that could, I (again) imagine, be 'easily' adapted to your application, a Simple Matter Of Programming: get a bunch of characters into a buffer; feed them one-by-one to the DFA RE; when the buffer becomes empty, get a bunch more characters; repeat until a match or end-of-file happens. Handwaving ends. Good luck in your endeavor, and I would be interested to learn your ultimate experience.

Update:   "... a DFA operates on a single character at a time without backtracking."   That thought was badly conceived and expressed. I suppose what I was thinking was that the pattern  m{ < [^<>]* > }xms is inherently atomic (Update: hence no backtracking need occur). I have spent too little time in DFA-land to know if any such regex compiler would be smart enough to recognize this fact or could be clued-in via a construct like Perl's  (?>pattern) atomic grouping or possessive quantifiers. Just more handwaving, really.

Replies are listed 'Best First'.
Re^3: Possible to have regexes act on file directly (not in memory)
by Nocturnus (Scribe) on May 04, 2014 at 08:30 UTC

    Well, thank you very much for making me read some articles about the theory of FA :-).

    While the patterns I have to search for are more complex than in the example I have mentioned above, I think they are still easy enough to be implemented in form of a linearly scanning state machine, so I think that's the way I will go. If the patterns get more complex, I will extend or rewrite my parser, hoping that I don't end up having to write my own regex parser. Just kidding ...