in reply to Multi-thread combining the results together

It may be obvious and you have already considered this (then I'm sorry, and skip what follows), but you are starting regex engine 6+ billion times. If the %result is relatively sparsely populated in the end, and if tokens can be joined using clearly "alien" separator symbol (or sequence) to prevent matching across tokens, then matching against concatenated string (regex engine starts just N times) can help. In code below, if line A is un-commented, then block B is executed N*N = 1e6 times as expected, and each token "matches" all other tokens -- very uninteresting. Otherwise, with more picky criteria of a token to be related to another token, your goal of "at least 3x faster" is easily achieved even before parallelization.

use strict; use warnings; use feature 'say'; use Data::Dump 'dd'; use Time::HiRes 'time'; my $N = 1000; srand 123; my @tokens = map { int rand 1_000_000 } 1 .. $N; sub build_regex { # return qr/\d+/; # line A my $s = shift; my $d = substr $s, 0, 1; qr/[0-9]$d\d{0,9}?$d/ } { # case 1 my $t = time; my %result; foreach my $token (@tokens) { my $regex = build_regex($token); my @line_results = grep {$_ ne $token and /$regex/ }@tokens; $result{$token} = [@line_results]; } say time - $t; } { # case 2 my $t = time; my $count = 0; my $sep = '~'; my $sep_len = length $sep; my @idx; for ( 0 .. $#tokens ) { my $L = length $tokens[ $_ ]; @idx[ map{ $sep_len + @idx + $_ } 0 .. $L - 1 ] = ( $_ ) x $L } my $concat = join $sep, '', @tokens, ''; my %result; for my $i ( 0 .. $#tokens ) { my $token = $tokens[ $i ]; my $regex = build_regex( $token ); $result{ $token } = []; my $prev = -1; while ( $concat =~ /$regex/g ) { # block B my $j = $idx[ $-[ 0 ]]; push @{ $result{ $token }}, $tokens[ $j ] if $j != $i and $j != $prev; $prev = $j; $count ++; } } say time - $t; say $count; } __END__ # Output with "A" line un-commented 0.978141069412231 1.23276996612549 1000000 # Output with "A" line commented-out 0.648768901824951 0.150562047958374 78176

Edit: (1) replaced separator character with more neutral "~" from "|", so it doesn't look like regex alternation; (2) added "comments" to output section, so it's more clear they are different runs.

Replies are listed 'Best First'.
Re^2: Multi-thread combining the results together
by Marshall (Canon) on Jul 25, 2019 at 10:52 UTC
    Interesting. Here is one example of what this does:
    testing: "k5bai" 0: ^.5BAI$ 1: ^K.BAI$ 2: ^K5.AI$ 3: ^K5B.I$ 4: ^K5BA.$ 5: .*5BAI\z 6: ^K5ABI\z <-N2 error 7: ^K5BIA\z <-N2 error 8: ^K5IBA\z <- really bad 9: ^K5AIB\z 10: ^KB5AI\z 11: ^K5BA\z 12: ^K5BI\z 13: ^K5AI\z 14: ^K5B\z my regex = (^.5BAI$)|(^K.BAI$)|(^K5.AI$)|(^K5B.I$)|(^K5BA.$)|(.*5BAI\z +)|(^K5ABI\z)|(^K5BIA\z)|(^K5IBA\z)|(^K5AIB\z)|(^KB5AI\z)|(^K5BA\z)|(^ +K5BI\z)|(^K5AI\z)|(^K5B\z)
    Instead of running for each @tokens, I suspect that it would be faster to run the regex against a single string of the concatenation of all of the tokens.

    I haven't thought about this code for many moons. Time for a re-think.

      If original build_regex can return expression with "start of string" or "end of string" markers, these should be replaced, for my approach, with lookarounds for $sep, of course.

      Huge regex with many alternations is different to what I suggested, and may suit better your data/expected output.

        ... "start of string" or "end of string" markers ... should be replaced ... with lookarounds for $sep ...

        But if  $sep can be  \n (newline), then the lookarounds become the built-in  ^ $ anchors (with the  /m modifier asserted, of course), which I would expect to be significantly faster than a constructed lookaround. The
            my $concat = join $sep, '', @tokens, '';
        statement building the target string becomes
            my $concat = join $sep, @tokens;
        because  ^ $ always behave as expected at start/end-of-string.


        Give a man a fish:  <%-{-{-{-<