comment on

... I am forbidden to assume anything regarding their contents or structure ...

This means you may not assume that in a file/string 1 TB in length, the length of a match is less than the length of the string. Or in code, writing something that could do this:

c:\@Work\Perl\monks>perl -wMstrict -le
"my $s =
   '<1e12 chars, none of which is a left or right angle bracket>';
 ;;
 print qq{match, captured '$1'} if $s =~ m{ < ([^<>]*) > }xms;
"
match, captured '1e12 chars, none of which is a left or right angle br
+acket'
[download]

My theoretical regex-fu is weak, but as I understand it, doing this with the type of regex engine (RE) that Perl uses, an NFA, would be impossible without a fundamental and total re-write of the RE code.

However, a DFA RE would, I imagine, be a different story. Insofar as I understand it, a DFA operates on a single character at a time without backtracking. It is the state-machine approach you mention above. The capabilities of a DFA RE are much more limited than Perl's much-enhanced (and no longer 'regular') NFA RE. However, I believe the example regex above is compatible with both NFA and DFA REs.

If your regex could be expressed in terms acceptable to a DFA RE, there are engines already available that could, I (again) imagine, be 'easily' adapted to your application, a Simple Matter Of Programming: get a bunch of characters into a buffer; feed them one-by-one to the DFA RE; when the buffer becomes empty, get a bunch more characters; repeat until a match or end-of-file happens. Handwaving ends. Good luck in your endeavor, and I would be interested to learn your ultimate experience.

Update: "... a DFA operates on a single character at a time without backtracking." That thought was badly conceived and expressed. I suppose what I was thinking was that the pattern m{ < [^<>]* > }xms is inherently atomic (Update: hence no backtracking need occur). I have spent too little time in DFA-land to know if any such regex compiler would be smart enough to recognize this fact or could be clued-in via a construct like Perl's (?>pattern) atomic grouping or possessive quantifiers. Just more handwaving, really.

In reply to Re^2: Possible to have regexes act on file directly (not in memory) by AnomalousMonk
in thread Possible to have regexes act on file directly (not in memory) by Nocturnus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.