in reply to 5.10.0 regex slowdown

BrowserUk,
Have you read perldelta paying attention to the new variable ${^RE_TRIE_MAXBUF}. Hopefully that's all it is (guessing).

Cheers - L~R

Replies are listed 'Best First'.
Re^2: 5.10.0 regex slowdown
by BrowserUk (Patriarch) on Feb 21, 2008 at 01:59 UTC

    There seems to be precious little further info on that var, though google did turn up this

    This value by default is 65536 which corresponds to a 512kB temporary cache. Set this to a higher value to trade memory for speed when matching large alternations.

    But so far, setting it higher (2**17) or lower (256) doesn't seem to change the perform or the break point one iota. Maybe demerphq will happen by and give us the skinny on it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      Ok, next thing you might want to look at (still guessing here) is the re pragma. Specifically the trie settings in 'Debug' mode (note 'Debug' ne 'debug'). I have sent a message to demerphq outside of PM channels in case he doesn't log in relatively soon as well.

      Cheers - L~R

      There seems to be precious little further info on that var

      Well, it's in in the perlvar manual page in my install ;-)

      But it looks like one has to use English to see its default value:

      qwurx [shmem] ~/perlmonks > perl5.10.0 -le 'print ${^RE_TRIE_MAXBUF}' qwurx [shmem] ~/perlmonks > perl5.10.0 -MEnglish -le 'print ${^RE_TRI +E_MAXBUF}' 65536

      But yes, the number 2**16 rings a bell. It looks like the TRIE-optimization has that limit (of tokens? objects? branches?). Running with -D512 reveals

      13104 probes:

      EXECUTING... Trying 13104 probes with perl 5.010000 at 669148.pl line 18. 1203590260 at 669148.pl line 20. Compiling REx "TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TGTGTTTAAC| +CCTCA"... Final program: 1: TRIEC-EXACT[ACGT] (65526) <TATGTTTCGT> <CCGCTTTTTA> <CGAAGATTTC> ... <AACAGTGAGG> <GAAACTCGCG> <GAGAGATGGA> 65526: END (0) stclass AHOCORASICKC-EXACT[ACGT] minlen 10 Compiling REx "((?-xism:TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TG +TGTTT"... Final program: 1: OPEN1 (3) 3: TRIEC-EXACT[ACGT] (65529) <TATGTTTCGT> <CCGCTTTTTA> <CGAAGATTTC> ... <AACAGTGAGG> <GAAACTCGCG> <GAGAGATGGA> 65529: CLOSE1 (65531) 65531: END (0) stclass AHOCORASICKC-EXACT[ACGT] minlen 10 Matching REx "((?-xism:TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TGT +GTTT"... against "ACTCGAATTCCGAATAGATAGAAGTCTGCTGATAATATCGCGCCGGT TCTGATGCGCCTC"... Matching stclass AHOCORASICKC-EXACT[ACGT] against "ACTCGAATTCCGAATAGAT +AGAAGTCTGCTGATAATATCGCGCCGGTTCTGATGCGCCTC"... (1000000 chars) 0 <> <ACTCGAATTC> | Charid: 2 CP: 41 State: 1, word=0 +- legal 1 <A> <CTCGAATTCC> | Charid: 4 CP: 43 State: 52, word=0 +- legal 2 <AC> <TCGAATTCCG> | Charid: 1 CP: 54 State: ce, word=0 +- legal 3 <ACT> <CGAATTCCGA> | Charid: 4 CP: 43 State: 1a2, word=0 +- legal 4 <ACTC> <GAATTCCGAA> | Charid: 3 CP: 47 State: 1a3, word=0 +- legal

      13105 probes:

      EXECUTING... Trying 13105 probes with perl 5.010000 at 669148.pl line 18. 1203590218 at 669148.pl line 20. Compiling REx "TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TGTGTTTAAC| +CCTCA"... Final program: 1: TRIEC-EXACT[ACGT] (65531) <TATGTTTCGT> <CCGCTTTTTA> <CGAAGATTTC> ... <GAAACTCGCG> <GAGAGATGGA> <CGCCGAGGAT> 65531: END (0) stclass AHOCORASICKC-EXACT[ACGT] minlen 10 Compiling REx "((?-xism:TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TG +TGTTT"... Final program: 1: OPEN1 (3) 3: BRANCHJ (11) 5: EXACT <TATGTTTCGT> (9) 9: LONGJMP (104850) 11: BRANCHJ (19) 13: EXACT <CCGCTTTTTA> (17) 17: LONGJMP (104850) 19: BRANCHJ (27) 21: EXACT <CGAAGATTTC> (25) 25: LONGJMP (104850) ...

      The 65531: END (0) looks - though I don't know at all what it means - just too close to 2**16 ...

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

        See those BRANCHJ EXACT LONGJMP tuples? They are confusing the optimiser so that it doesnt recognize this as a trie'able sequence. Ill have to think about what i can do about that.

        I doubt ill have time soon tho.

        Offline ysth mentioned a cool suggestion:

        [2008-02-22 01:45] <ysth1> I ended up dividing my list into three diff +erent regexes and used them like (?:(??{$rxa})|(??{$rxb})|(??{$rxc}))

        Which isnt ideal but better than no trie at all. Sorry about this. :-(

        ---
        $world=~s/war/peace/g