in reply to Re: Possible to have regexes act on file directly (not in memory)
in thread Possible to have regexes act on file directly (not in memory)
... I am forbidden to assume anything regarding their contents or structure ...
This means you may not assume that in a file/string 1 TB in length, the length of a match is less than the length of the string. Or in code, writing something that could do this:
c:\@Work\Perl\monks>perl -wMstrict -le "my $s = '<1e12 chars, none of which is a left or right angle bracket>'; ;; print qq{match, captured '$1'} if $s =~ m{ < ([^<>]*) > }xms; " match, captured '1e12 chars, none of which is a left or right angle br +acket'
My theoretical regex-fu is weak, but as I understand it, doing this with the type of regex engine (RE) that Perl uses, an NFA, would be impossible without a fundamental and total re-write of the RE code.
However, a DFA RE would, I imagine, be a different story. Insofar as I understand it, a DFA operates on a single character at a time without backtracking. It is the state-machine approach you mention above. The capabilities of a DFA RE are much more limited than Perl's much-enhanced (and no longer 'regular') NFA RE. However, I believe the example regex above is compatible with both NFA and DFA REs.
If your regex could be expressed in terms acceptable to a DFA RE, there are engines already available that could, I (again) imagine, be 'easily' adapted to your application, a Simple Matter Of Programming: get a bunch of characters into a buffer; feed them one-by-one to the DFA RE; when the buffer becomes empty, get a bunch more characters; repeat until a match or end-of-file happens. Handwaving ends. Good luck in your endeavor, and I would be interested to learn your ultimate experience.
Update: "... a DFA operates on a single character at a time without backtracking." That thought was badly conceived and expressed. I suppose what I was thinking was that the pattern m{ < [^<>]* > }xms is inherently atomic (Update: hence no backtracking need occur). I have spent too little time in DFA-land to know if any such regex compiler would be smart enough to recognize this fact or could be clued-in via a construct like Perl's (?>pattern) atomic grouping or possessive quantifiers. Just more handwaving, really.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Possible to have regexes act on file directly (not in memory)
by Nocturnus (Scribe) on May 04, 2014 at 08:30 UTC |