comment on

It may be obvious and you have already considered this (then I'm sorry, and skip what follows), but you are starting regex engine 6+ billion times. If the %result is relatively sparsely populated in the end, and if tokens can be joined using clearly "alien" separator symbol (or sequence) to prevent matching across tokens, then matching against concatenated string (regex engine starts just N times) can help. In code below, if line A is un-commented, then block B is executed N*N = 1e6 times as expected, and each token "matches" all other tokens -- very uninteresting. Otherwise, with more picky criteria of a token to be related to another token, your goal of "at least 3x faster" is easily achieved even before parallelization.

use strict;
use warnings;
use feature 'say';
use Data::Dump 'dd';
use Time::HiRes 'time';

my $N = 1000;

srand 123;
my @tokens = map { int rand 1_000_000 } 1 .. $N;

sub build_regex { 

#    return qr/\d+/;    # line A

    my $s = shift;
    my $d = substr $s, 0, 1;
    qr/[0-9]$d\d{0,9}?$d/ 
}

{                       # case 1
my $t = time;

my %result;
foreach my $token (@tokens)
{
    my $regex = build_regex($token);
    my @line_results = grep {$_ ne $token and /$regex/ }@tokens;
    $result{$token} = [@line_results];
}
say time - $t;
}

{                       # case 2
my $t = time;
my $count = 0;

my $sep = '~';
my $sep_len = length $sep;
my @idx;
for ( 0 .. $#tokens ) {
    my $L = length $tokens[ $_ ];
    @idx[ map{ $sep_len + @idx + $_ } 0 .. $L - 1 ] = ( $_ ) x $L
}
my $concat = join $sep, '', @tokens, '';

my %result;
for my $i ( 0 .. $#tokens ) {
    my $token = $tokens[ $i ];
    my $regex = build_regex( $token );
    
    $result{ $token } = [];

    my $prev = -1;
    while ( $concat =~ /$regex/g ) {    # block B
        my $j = $idx[ $-[ 0 ]];
        push @{ $result{ $token }}, $tokens[ $j ] 
            if $j != $i and $j != $prev;
        $prev = $j;
        $count ++;
    }
}

say time - $t;
say $count;
}


__END__

# Output with "A" line un-commented
0.978141069412231
1.23276996612549
1000000

# Output with "A" line commented-out
0.648768901824951
0.150562047958374
78176
[download]

Edit: (1) replaced separator character with more neutral "~" from "|", so it doesn't look like regex alternation; (2) added "comments" to output section, so it's more clear they are different runs.

In reply to Re: Multi-thread combining the results together by vr
in thread Multi-thread combining the results together by Marshall

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.