in reply to Re: Multi-thread combining the results together
in thread Multi-thread combining the results together

Interesting. Here is one example of what this does:
testing: "k5bai" 0: ^.5BAI$ 1: ^K.BAI$ 2: ^K5.AI$ 3: ^K5B.I$ 4: ^K5BA.$ 5: .*5BAI\z 6: ^K5ABI\z <-N2 error 7: ^K5BIA\z <-N2 error 8: ^K5IBA\z <- really bad 9: ^K5AIB\z 10: ^KB5AI\z 11: ^K5BA\z 12: ^K5BI\z 13: ^K5AI\z 14: ^K5B\z my regex = (^.5BAI$)|(^K.BAI$)|(^K5.AI$)|(^K5B.I$)|(^K5BA.$)|(.*5BAI\z +)|(^K5ABI\z)|(^K5BIA\z)|(^K5IBA\z)|(^K5AIB\z)|(^KB5AI\z)|(^K5BA\z)|(^ +K5BI\z)|(^K5AI\z)|(^K5B\z)
Instead of running for each @tokens, I suspect that it would be faster to run the regex against a single string of the concatenation of all of the tokens.

I haven't thought about this code for many moons. Time for a re-think.

Replies are listed 'Best First'.
Re^3: Multi-thread combining the results together
by 1nickt (Canon) on Jul 25, 2019 at 11:12 UTC
Re^3: Multi-thread combining the results together
by vr (Curate) on Jul 25, 2019 at 12:29 UTC

    If original build_regex can return expression with "start of string" or "end of string" markers, these should be replaced, for my approach, with lookarounds for $sep, of course.

    Huge regex with many alternations is different to what I suggested, and may suit better your data/expected output.

      ... "start of string" or "end of string" markers ... should be replaced ... with lookarounds for $sep ...

      But if  $sep can be  \n (newline), then the lookarounds become the built-in  ^ $ anchors (with the  /m modifier asserted, of course), which I would expect to be significantly faster than a constructed lookaround. The
          my $concat = join $sep, '', @tokens, '';
      statement building the target string becomes
          my $concat = join $sep, @tokens;
      because  ^ $ always behave as expected at start/end-of-string.


      Give a man a fish:  <%-{-{-{-<

        That's an interesting idea. I was thinking of trying a single string made of space separated tokens. In that case the ^$ would become \b's. And a grep is not needed because I would be doing match global against a single string instead of running the built regex 80K times against each token individually. There is no reason that I couldn't join the tokens by \n and I could try that without modifying build_regex().

        As a note, the array of @tokens are all unique. For each token, I want it either fully copied or nothing (a yes/no situation for each of the 80K tokens). A typical regex will have 10-14 terms and produces a result set of about 6 results from 80K possibilities.

        If I can get maybe a 3x from algorithm improvements and another 3x from parallelization. I would be in the <10 minute max run time range which is "good enough". As it turns out in practice, not every possibility needs to be run and when a token needs to be investigated further for "close matches", I cache the result. More than a decade ago, run time was 20 minutes max on an Win 95 machine. One of the "problems" with software that "works" is that it often winds up being applied to larger and larger data sets. The 80K terms are extracted from 3 million input lines. 12 years ago, this was only 200K input lines and much smaller @tokens array!

        I appreciate all of the ideas in this thread! I have a lot of experimentation ahead of me.

        Ultimately, I would like to develop an algorithm that builds some kind of a tree structure which can be traversed much faster than any regex approach. I figure that will be non-trivial to accomplish.

        Update: I tried the idea of using a multi-line, match global upon a string of \n separated tokens instead of running a regex on each token individually. This didn't work. This is significantly slower than the current code. It produces the same result, albeit slower. Next up: I will try the \b idea.