in reply to Possible to have regexes act on file directly (not in memory)

Does your input file have any concept of logical records? Or how about this: Would your regex have any literal anchor that can be known ahead of time?

It could be that "newlines" are an infrequent occurrence. But maybe there is some other possible concrete point of reference that could be used as a record separator so that you're reading in smaller chunks that have some logical relationship to one another.

If that turns out to not be the case, then I suppose your best solution will be the one others have mentioned; determine what the largest possible "match" could be, set your chunk size to that size, and read a starter chunk. Then read a second chunk, concatenate them, do a pattern match, discard the first chunk, read a third, concatenate, match, repeat.


Dave

  • Comment on Re: Possible to have regexes act on file directly (not in memory)

Replies are listed 'Best First'.
Re^2: Possible to have regexes act on file directly (not in memory)
by Laurent_R (Canon) on May 02, 2014 at 22:28 UTC
    If that turns out to not be the case, then I suppose your best solution will be the one others have mentioned; determine what the largest possible "match" could be, set your chunk size to that size, and read a starter chunk. Then read a second chunk, concatenate them, do a pattern match, discard the first chunk, read a third, concatenate, match, repeat.

    I agree with the general approach, but not with the details. There is no reason to choose a chunk size that is equal to the largest possible match, the chunk size can be much larger.

    Suppose the max length of a possible match is 10 characters (or bytes, or whatever). You certainly don't want to read your file by chunks of 10 characters. That would be fairly inefficient.

    Depending on your system, it might be more efficient to read chunks of, say, 1 MB. The only thing you need to do is to keep the last 10 characters of the previous chunk and to "prepend" it to the next chunk before proceeding. Or, in other words, to append the next MB of data to the last 10 characters of the previous chunk. And run your regex again on that.