... I am forbidden to assume anything regarding their contents or structure ...

This means you may not assume that in a file/string 1 TB in length, the length of a match is less than the length of the string. Or in code, writing something that could do this:

c:\@Work\Perl\monks>perl -wMstrict -le "my $s = '<1e12 chars, none of which is a left or right angle bracket>'; ;; print qq{match, captured '$1'} if $s =~ m{ < ([^<>]*) > }xms; " match, captured '1e12 chars, none of which is a left or right angle br +acket'

My theoretical regex-fu is weak, but as I understand it, doing this with the type of regex engine (RE) that Perl uses, an NFA, would be impossible without a fundamental and total re-write of the RE code.

However, a DFA RE would, I imagine, be a different story. Insofar as I understand it, a DFA operates on a single character at a time without backtracking. It is the state-machine approach you mention above. The capabilities of a DFA RE are much more limited than Perl's much-enhanced (and no longer 'regular') NFA RE. However, I believe the example regex above is compatible with both NFA and DFA REs.

If your regex could be expressed in terms acceptable to a DFA RE, there are engines already available that could, I (again) imagine, be 'easily' adapted to your application, a Simple Matter Of Programming: get a bunch of characters into a buffer; feed them one-by-one to the DFA RE; when the buffer becomes empty, get a bunch more characters; repeat until a match or end-of-file happens. Handwaving ends. Good luck in your endeavor, and I would be interested to learn your ultimate experience.

Update:   "... a DFA operates on a single character at a time without backtracking."   That thought was badly conceived and expressed. I suppose what I was thinking was that the pattern  m{ < [^<>]* > }xms is inherently atomic (Update: hence no backtracking need occur). I have spent too little time in DFA-land to know if any such regex compiler would be smart enough to recognize this fact or could be clued-in via a construct like Perl's  (?>pattern) atomic grouping or possessive quantifiers. Just more handwaving, really.


In reply to Re^2: Possible to have regexes act on file directly (not in memory) by AnomalousMonk
in thread Possible to have regexes act on file directly (not in memory) by Nocturnus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.