Okay. This is what I think I would try given your situation:
#! perl -slw
use strict;
use threads;
use threads::shared;
use threads::Q; ## Self-limiting sized queues.
use Time::HiRes qw[ time ];
sub uniq{ my %h; undef @h{ @_ }; keys %h }
my $start = time;
our $Q //= 100; ## The maximum size of the Qs
our $T //= 4; ## No of worker threads at both stages
our $F //= '*.txt'; ## File selector for testing
my $semFIO :shared; ## serialise disk access
my $Qxmlin = threads::Q->new( $Q ); ## filenames to XML threads
my $Qitems = threads::Q->new( $Q ); ## extracted items from XML thread
+s
async { ## "XML Parser" thread pool
while( my $file = $Qxmlin->dq ) {
my $xml = do{ lock $semFIO; local( @ARGV, $/ ) = $file; <> };
$Qitems->nq( join $;, split ' : ', $_ )
for split "\n", $xml;
}
$Qitems->nq( undef );
}->detach for 1 .. $T;
async { ## Q up filenames for XML processing pool
$Qxmlin->nq( glob "sha/$F" );
$Qxmlin->nq( (undef) x $T );
}->detach;
my @items; ## Non-shared storage for extracted items
for( 1 .. $T ) { ## Gather them all together
push @items, $_ while defined( $_ = $Qitems->dq );
}
undef $Qxmlin; undef $Qitems; ## Done with these.
print STDERR scalar @items; ## Sanity check of items count.
my $Qcmp = threads::Q->new( $Q ); ## Item pairs for comparison in
my $Qsim = threads::Q->new( $Q ); ## Similar rated items out
async { ## similarities assessment thread pool
while( my $work = $Qcmp->dq ) {
my( $key1, $val1, $key2, $val2 ) = split $;, $work;
my %bigrams1; undef @bigrams1{ uniq unpack '(A2)*', $key1 };
my @bigrams2 = uniq unpack '(A2)*', $key2;
my $count = grep exists( $bigrams1{ $_ } ), @bigrams2;
my $sim = $count * 2 / ( keys( %bigrams1 ) + @bigrams2 );
$Qsim->nq( $work ) if $sim > 0.2; ## low-value required to al
+low my test data to generate hits.
}
$Qsim->nq( undef );
}->detach for 1 .. $T;
async { ## Q up the pairs for comparison
for my $i1 ( 0 .. $#items ) {
my $item1 = $items[ $i1 ];
for my $i2 ( $i1 + 1 .. $#items ) {
my $item2 = $items[ $i2 ];
$Qcmp->nq( "$item1$;$item2" );
}
}
$Qcmp->nq( (undef) x $T );
}->detach;
## gather together those that pass the criteria
## And do something with them.
for( 1 .. $T ) {
while( my $sim = $Qsim->dq ) {
print $sim;
}
}
printf STDERR "With T:$T Q:$Q took %.3f s\n", time - $start;
The comments in this code -- which simulates your XML files using simple flat text files of sha256_hex keys and a number from which the sha256 was derived -- are sparse; and the questions probably many. Easier to answer your actual questions than try and guess what they might be and answer them in comments.
There are two tunable parameters -- the size of the queues; which should be at least double the size of the pools -- and the size of the pools (threads). I've used the same numbers for both halves of the program, but you might want to try using different values and tuning them separately.
Take a look -- ask whatever questions arise :)
It uses my own threads::Q implementation of a self-limiting (size) queue:
That is pretty well exercised -- except the q_nb() which I've never had occasion to use for real -- but is undocumented, hence not on CPAN; new() takes a single argument that is the maximum number of items the Q can hold at any given time. dq() is dequeue(); nq() is enqueue(); cq() (see-queue) is pending().
Its benefit is that it controls the unrestricted growth that can occur when the producer runs faster than the consumer. To its possible debit is that it doesn't accept anything other than scalars or references to pre-shared structures. I consider this a positive as in my experiments it is faster and far less memory hungry to pass compound data joined into a scalar and split it in the consumer, than to copy data in to a shared structure and copy it back again.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP Neil Armstrong
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.