| [reply] |
#! perl -slw
use strict;
use threads;
use threads::shared;
use Thread::Queue;
use Text::Fuzzy;
our $semSTDIO :shared;
sub tprint {
lock $semSTDIO;
print @_;
}
sub worker {
my( $Q, $file1 ) = @_;
my @array = split "\n", $file1;
while( my $work = $Q->dequeue ) {
my $tf = Text::Fuzzy->new( $work );
my $index = $tf->nearest( \@array );
tprint( "$work\t$array[ $index ]" );
}
}
our $T //= 4; ## -T=number of threads to use.
our $F1 //= 'big.file'; ## -F1=big.file ## the 500,000 line fil
+e
our $F2 //= 'small.file'; ## -F2=small.file ## the 500 line file
my $Q = new Thread::Queue;
my $file1 = do{ local( @ARGV, $/ ) = $F1; <> };
my @threads = map threads->new( \&worker, $Q, $file1 ), 1 .. $T;
$Q->enqueue( do{ local @ARGV = $F2; <> }, (undef) x $T );
$_->join for @threads;
Use as:thisScript -T=8 -F1=big.file -F2=small.file > results.file
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] [select] |
Indeed.
Before embarking on this suggestion, it might also be useful to time the amount of time it currently takes Text::Fuzzy to process any one file. If it turns out that (as I frankly expect ...) the time required is “more-or-less 1/500th of more-or-less 30 minutes,” then I predict that your chances of serious improvement are quite good.
The “meat and potatoes” of this routine are a single “C” module which, from the look of things, does not appear (to me, at first blush) to have any “contentious globals.” And, if the total size of the file to be processed is “a mere ...” 30 megabytes, on a reasonably-modern machine, the odds are excellent that the entire file will be moved into operating-system buffer memory and that it will stay there ... avoiding actual disk-I/O for all future users. Thus, as long as(!!) you take due care to be sure that the total memory requirement of all threads is comfortably less-than the available memory on the machine, you ought to be able to reap some very nice improvements here.
Be sure that each thread, when it has finished doing its appointed unit of work, releases all the memory that it has obtained in the process of doing it. Experiment, also, with the total number of simultaneous threads that you launch. There will be a definite “sweet spot.” As you increase the number of threads, the CPU utilization ought to rise much closer to 100%, but virtual-memory paging activity should not experience any sort of sustained increase.
Before embarking upon any of this, however, I would suggest simply running two instances of the existing Perl script, in two different “command-line windows” at the same time. The time-command output (in Unix/Linux) of these, when added together, should be “noticeably less than” twice the amount of time that the same command took when run by itself. (Duly allowing for the fact that, in this case, they are of course separate processes, not just threads.) If this appears to be the case, then the prospects of multithreading are probably worth pursuing.
| |
How big is your file? If you have enough memory in your computer then you can simply read the file into an array and then use 'nearest' method.
my @lines = <STDIN>;
foreach my $pattern (@array1) {
my $tf = Text::Fuzzy->new ($pattern);
my $fuzzy = $tf->nearest( \@lines ); # find fuzzy scanning the arr
+ay of lines
print "$pattern\t$fuzzy\n";
}
| [reply] [d/l] |
| [reply] |
| [reply] |