in reply to Re^3: Possible to have regexes act on file directly (not in memory)
in thread Possible to have regexes act on file directly (not in memory)

I have basically explained how to do it: keeping part of the previous chunk and appending the new chunk to it. The real difficulty is whether it is possible to determine the length of longest possible match for the regex (which determines how much to keep from one chunk to another). For some regexes, it is very easy, for others, it is very difficult or even impossible. The OP does not give enough information on that.

Then, there is the question of the size of the input. On my server, processing a 10 GB (line-based) file with a relatively simple regex might take 5 to 10 minutes. It would probably be a bit faster if not line-based, reading chunks of say 1 MB. With a TB of data, it would take quite a bit of time, but that might still be relatively manageable. But that's assuming a simple regex with no need to backtrack. With a regex implying a lot of backtracking, it might very easily be completely unmanageable.

  • Comment on Re^4: Possible to have regexes act on file directly (not in memory)

Replies are listed 'Best First'.
Re^5: Possible to have regexes act on file directly (decompose regex)
by LanX (Saint) on May 02, 2014 at 19:28 UTC
    > For some regexes, it is very easy, for others, it is very difficult or even impossible.

    it's often much simpler as you might think, cause you can decompose regexes into smaller and easier parts.

    perl -e'use re 'debug';qr/x{100}.*y{100}/' Compiling REx "x{100}.*y{100}" Final program: 1: CURLY {100,100} (5) 3: EXACT <x> (0) 5: STAR (7) 6: REG_ANY (0) 7: CURLY {100,100} (11) 9: EXACT <y> (0) 11: END (0) anchored "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"... at 0 floating "yyyyyyyyyy +yyyyyyyyyyyyyyyyyyyy"... at 100..2147483647 (checking floating) minle +n 200 Freeing REx: "x{100}.*y{100}"

    In this case you start looking for 'x'x100 in a sliding window of size >200 from the beginning. Then you search backwards from the end in sliding windows for 'y'x100.

    Like this even greedy matches can be handled (mostly) and the total match might even cover terrabytes.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Thank you for you very interesting point.

      If I have understood you correctly (I think I did), this is IMHO no longer a straight regex pattern match, but this is really becoming a very basic parser using several regexes. Or it can be viewed as building your own very basic custom pseudo-regex engine in pure Perl, using Perl's built-in regex engine for matching the various components of the original regex pattern.

      Yes, I believe that this can indeed work on a number of cases, but this can quickly become hairy for somewhat complicated patterns having more than one * or + quantifiers, alternations, captured matches, look-around assertions, etc. I have actually built a few times very primitive parsers using several regexes to parse progressively data that I could not figure out how to analyze with a single regex. This works, but this means that if the pattern changes, then you don't only need to change the pattern, but you also probably need to write different code around it.

      But all this is really speculation so long as the OP does not tell us which kind of regexes she or he wants to use.

        It's recursive, I don't know if all cases can be covered¹ but most cases are much easier then one might think:

        > perl -e'use re 'debug'; qr/(x{100}.*y{100})*/' Compiling REx "(x{100}.*y{100})*" Final program: 1: CURLYX[0] {0,32767} (18) 3: OPEN1 (5) 5: CURLY {100,100} (9) 7: EXACT <x> (0) 9: STAR (11) 10: REG_ANY (0) 11: CURLY {100,100} (15) 13: EXACT <y> (0) 15: CLOSE1 (17) 17: WHILEM[1/1] (0) 18: NOTHING (19) 19: END (0) minlen 0 Freeing REx: "(x{100}.*y{100})*"

        > But all this is really speculation so long as the OP does not tell us which kind of regexes she or he wants to use.

        Yes the monastery has the tendency to answer kindergarten questions with a PHD thesis.

        I certainly won't start to implement it, just showing the theoretical approach.

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        update

        ¹) with an opcode interpreter