in reply to Parsing 12GB Entourage database in pieces...

There might be a completely different solution. Using Sys::Mmap, you can map the entire file to a single string, and then brute-force through that string with something like while ($string =~ m/.../g) {...} (like you hinted in your post).

However, there are some caveats:

Replies are listed 'Best First'.
Re^2: Parsing 12GB Entourage database in pieces...
by ikegami (Patriarch) on Aug 28, 2008 at 20:06 UTC

    I don't know how regular expressions perform with such a huge string.

    Some uses of "*" are equivalent to "{0,32767}", so you might have problems.

    >perl -Mre=debug -we"qr/^(.)(\1*)\z/" ... 9: CURLYX[1] {0,32767}(14) ...

    Be sure to prevent backtracking using (?>...) or (in 5.10.0+) the possessive quantifier.

    Update: "/\0\0MSrc.{16}((?>[^\0]*))(?=\0)/s" looks safe.

      Interesting...

      perl5.8.8 -wle '$s = "x" x 40_000; $s =~ /^(.)(\1*)/ and print length $2' # (segfaults) perl5.10.0 -wle '$s = "x" x 40_000; $s =~ /^(.)(\1*)/ and print length $2' Complex regular subexpression recursion limit (32766) exceeded at -e l +ine 1. 32767

      Well, at least it seems to warn when this limitation affects the result...

        I believe p5p is working on a patch to change that warning to a die.