in reply to Re^3: Efficient matching with accompanying data
in thread Efficient matching with accompanying data

I have no time to look into your code but your benchmarks always talk about "... seconds using *a hash*"

Cheers Rolf

( addicted to the Perl Programming Language)

  • Comment on Re^4: Efficient matching with accompanying data

Replies are listed 'Best First'.
Re^5: Efficient matching with accompanying data
by BrowserUk (Patriarch) on Jul 11, 2013 at 16:25 UTC
    your benchmarks always talk about "... seconds using *a hash*"

    Did you read the line:

    Each run was manually killed after checking that the setting made no difference to the memory usage and ~20 seconds had elapsed. As you can see, it was still only processing ~10 line/s.

    The benchmark code is the same as I posted above. It first tests the hash method (and prints out the timing); then the BigOR regex method, and would then print out its timing, except that as it takes the regex engine 34.5 minutes to complete the test (that the hash does in 0.17 seconds), I couldn't be bothered to wait for the 8 hours it would take to complete all 14 runs, so I monitored the tests and when the running line count from the regex test showed that it was still running at ~1 lines per second, I aborted that run (via the task manager) after ~20 seconds.

    I was also monitoring the memory usage of the processes in anticipation that if the trie optimisation was being skipped because it would require more memory than preset limit, once that preset limit had been raised high enough that the trie was built, there would be a very obvious jump in memory usage No such jump ever occurred. All 14 instances of the program showed an identical 77MB max memory usage.

    If 2^16 equates to 512MiB, then 2^32 must equate to 2^32*8192 -> 32Tib? (which I obviously do not have), but somewhere in between 2^16 & 2^32, there should have been some indication that the trie was being built, and there was not. (I suspect that it also has some hard upper limit to the number of alternations it will try to handle.)

    I'd love to see a demonstration of the trie optimisation actually doing something.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Unfortunately I've spend some time I don't have into this! :(

      As it seems that trie optimization brakes for more than 15000 alternative patterns.

      up to this number the performance is competitive:

      481Finding 883 words (of 5038) took 0.039814 seconds using a hash 482Finding 784 words took 0.027246 seconds using a trie(via regex engi +ne)

      please note that I somewhere when experimenting I introduced a bug which skipped 100 matches...

      couldn't find the whole of Darwin's text and took chapter2, and as a dictionary I took (and cleaned) the en-US.dic in mozillas path.

      Please note some changes I did:

      1. I sorted descending by size not ascending, to fulfill the OP's requirements of multiword matches.

      2. You didn't use word delimiter, such that the regex has to start matching in the middle of the word. Thats why I put spaces around the parens.

      3. I don't know why you lowercased the input, my dict is case-sensitive, as a side note perl has lc

      4. I don't see the point of looping over m//g if you can get all matches in one attempt.

      5. I precompiled the regex to avoid recompilations.

      > I'd love to see a demonstration of the trie optimisation actually doing something.

      OK, I'm posting the code now as it is as a starting point for everyone who wants to experiment with it: Disabling the buffer (negative value) will show the effect of missing trie.

      have fun!

      Cheers Rolf

      ( addicted to the Perl Programming Language)

      PS: The requirements of the OP weren't clear for me, if only whole words are of interest I recommend sticking with the hash method.

        it seems that trie optimization brakes for more than 10000 alternatives.

        Seems the goalposts have changed slightly since it was first implemented, it was between 12,000 and 13,000 back then .

        For most of my stuff, that limit is way to low to consider.

        It'd be nice if you'd include the results of your benchmark in your post.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      If 2^16 equates to 512MiB, then 2^32 must equate to 2^32*8192 -> 32Tib? (which I obviously do not have),

      Nitpicking: according to the manual fragment quoted above, 2^16 equates to 512 KB, so 2^32 should equate to 32 GB