Re: Multi-thread combining the results together

It may be obvious and you have already considered this (then I'm sorry, and skip what follows), but you are starting regex engine 6+ billion times. If the %result is relatively sparsely populated in the end, and if tokens can be joined using clearly "alien" separator symbol (or sequence) to prevent matching across tokens, then matching against concatenated string (regex engine starts just N times) can help. In code below, if line A is un-commented, then block B is executed N*N = 1e6 times as expected, and each token "matches" all other tokens -- very uninteresting. Otherwise, with more picky criteria of a token to be related to another token, your goal of "at least 3x faster" is easily achieved even before parallelization.

use strict;
use warnings;
use feature 'say';
use Data::Dump 'dd';
use Time::HiRes 'time';

my $N = 1000;

srand 123;
my @tokens = map { int rand 1_000_000 } 1 .. $N;

sub build_regex { 

#    return qr/\d+/;    # line A

    my $s = shift;
    my $d = substr $s, 0, 1;
    qr/[0-9]$d\d{0,9}?$d/ 
}

{                       # case 1
my $t = time;

my %result;
foreach my $token (@tokens)
{
    my $regex = build_regex($token);
    my @line_results = grep {$_ ne $token and /$regex/ }@tokens;
    $result{$token} = [@line_results];
}
say time - $t;
}

{                       # case 2
my $t = time;
my $count = 0;

my $sep = '~';
my $sep_len = length $sep;
my @idx;
for ( 0 .. $#tokens ) {
    my $L = length $tokens[ $_ ];
    @idx[ map{ $sep_len + @idx + $_ } 0 .. $L - 1 ] = ( $_ ) x $L
}
my $concat = join $sep, '', @tokens, '';

my %result;
for my $i ( 0 .. $#tokens ) {
    my $token = $tokens[ $i ];
    my $regex = build_regex( $token );
    
    $result{ $token } = [];

    my $prev = -1;
    while ( $concat =~ /$regex/g ) {    # block B
        my $j = $idx[ $-[ 0 ]];
        push @{ $result{ $token }}, $tokens[ $j ] 
            if $j != $i and $j != $prev;
        $prev = $j;
        $count ++;
    }
}

say time - $t;
say $count;
}


__END__

# Output with "A" line un-commented
0.978141069412231
1.23276996612549
1000000

# Output with "A" line commented-out
0.648768901824951
0.150562047958374
78176
[download]

Edit: (1) replaced separator character with more neutral "~" from "|", so it doesn't look like regex alternation; (2) added "comments" to output section, so it's more clear they are different runs.

Comment on Re: Multi-thread combining the results together Select or Download Code

Replies are listed 'Best First'.
Re^2: Multi-thread combining the results together by Marshall (Canon) on Jul 25, 2019 at 10:52 UTC
Interesting. Here is one example of what this does: `testing: "k5bai" 0: ^.5BAI$ 1: ^K.BAI$ 2: ^K5.AI$ 3: ^K5B.I$ 4: ^K5BA.$ 5: .5BAI\z 6: ^K5ABI\z <-N2 error 7: ^K5BIA\z <-N2 error 8: ^K5IBA\z <- really bad 9: ^K5AIB\z 10: ^KB5AI\z 11: ^K5BA\z 12: ^K5BI\z 13: ^K5AI\z 14: ^K5B\z my regex = (^.5BAI$)\|(^K.BAI$)\|(^K5.AI$)\|(^K5B.I$)\|(^K5BA.$)\|(.5BAI\z +)\|(^K5ABI\z)\|(^K5BIA\z)\|(^K5IBA\z)\|(^K5AIB\z)\|(^KB5AI\z)\|(^K5BA\z)\|(^ +K5BI\z)\|(^K5AI\z)\|(^K5B\z)` [download] Instead of running for each @tokens, I suspect that it would be faster to run the regex against a single string of the concatenation of all of the tokens. I haven't thought about this code for many moons. Time for a re-think.	[reply] [d/l]
Re^3: Multi-thread combining the results together by 1nickt (Canon) on Jul 25, 2019 at 11:12 UTC
For regex alternation haukex's Building Regex Alternations Dynamically is an excellent read. The way forward always starts with a minimal test.	[reply]
Re^3: Multi-thread combining the results together by vr (Curate) on Jul 25, 2019 at 12:29 UTC
If original `build_regex` can return expression with "start of string" or "end of string" markers, these should be replaced, for my approach, with lookarounds for `$sep`, of course. Huge regex with many alternations is different to what I suggested, and may suit better your data/expected output.	[reply] [d/l] [select]
Re^4: Multi-thread combining the results together by AnomalousMonk (Archbishop) on Jul 25, 2019 at 14:50 UTC
... "start of string" or "end of string" markers ... should be replaced ... with lookarounds for `$sep` ... But if `$sep` can be `\n` (newline), then the lookarounds become the built-in `^ $` anchors (with the `/m` modifier asserted, of course), which I would expect to be significantly faster than a constructed lookaround. The `my $concat = join $sep, '', @tokens, '';` statement building the target string becomes `my $concat = join $sep, @tokens;` because `^ $` always behave as expected at start/end-of-string. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^5: Multi-thread combining the results together by Marshall (Canon) on Jul 27, 2019 at 02:53 UTC
Re^6: Multi-thread combining the results together by vr (Curate) on Jul 27, 2019 at 17:30 UTC