Multithreading for perl code

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Multithreading for perl code by QM (Parson) on Apr 08, 2015 at 10:43 UTC
Your loop appears wrong way round. It reads the file 500 times (for an array with 500 elements). Why not read the file line by line, and check each array element? Does Text::Fuzzy allow for that? -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re: Multithreading for perl code by BrowserUk (Patriarch) on Apr 08, 2015 at 13:52 UTC
Try it this way (Note:untested): #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use Text::Fuzzy; our $semSTDIO :shared; sub tprint { lock $semSTDIO; print @_; } sub worker { my( $Q, $file1 ) = @_; my @array = split "\n", $file1; while( my $work = $Q->dequeue ) { my $tf = Text::Fuzzy->new( $work ); my $index = $tf->nearest( \@array ); tprint( "$work\t$array[ $index ]" ); } } our $T //= 4; ## -T=number of threads to use. our $F1 //= 'big.file'; ## -F1=big.file ## the 500,000 line fil +e our $F2 //= 'small.file'; ## -F2=small.file ## the 500 line file my $Q = new Thread::Queue; my $file1 = do{ local( @ARGV, $/ ) = $F1; <> }; my @threads = map threads->new( \&worker, $Q, $file1 ), 1 .. $T; $Q->enqueue( do{ local @ARGV = $F2; <> }, (undef) x $T ); $_->join for @threads; [download] Use as:`thisScript -T=8 -F1=big.file -F2=small.file > results.file` With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^2: Multithreading for perl code by locked_user sundialsvc4 (Abbot) on Apr 09, 2015 at 01:15 UTC
Indeed. Before embarking on this suggestion, it might also be useful to `time` the amount of time it currently takes Text::Fuzzy to process any one file. If it turns out that (as I frankly expect ...) the time required is “more-or-less 1/500th of more-or-less 30 minutes,” then I predict that your chances of serious improvement are quite good. The “meat and potatoes” of this routine are a single “C” module which, from the look of things, does not appear (to me, at first blush) to have any “contentious globals.” And, if the total size of the file to be processed is “a mere ...” 30 megabytes, on a reasonably-modern machine, the odds are excellent that the entire file will be moved into operating-system buffer memory and that it will stay there ... avoiding actual disk-I/O for all future users. Thus, as long as(!!) you take due care to be sure that the total memory requirement of all threads is comfortably less-than the available memory on the machine, you ought to be able to reap some very nice improvements here. Be sure that each thread, when it has finished doing its appointed unit of work, releases all the memory that it has obtained in the process of doing it. Experiment, also, with the total number of simultaneous threads that you launch. There will be a definite “sweet spot.” As you increase the number of threads, the CPU utilization ought to rise much closer to 100%, but virtual-memory paging activity should not experience any sort of sustained increase. Before embarking upon any of this, however, I would suggest simply running two instances of the existing Perl script, in two different “command-line windows” at the same time. The `time`-command output (in Unix/Linux) of these, when added together, should be “noticeably less than” twice the amount of time that the same command took when run by itself. (Duly allowing for the fact that, in this case, they are of course separate processes, not just threads.) If this appears to be the case, then the prospects of multithreading are probably worth pursuing.
Re: Multithreading for perl code by pme (Monsignor) on Apr 08, 2015 at 10:38 UTC
How big is your file? If you have enough memory in your computer then you can simply read the file into an array and then use 'nearest' method. `my @lines = <STDIN>; foreach my $pattern (@array1) { my $tf = Text::Fuzzy->new ($pattern); my $fuzzy = $tf->nearest( \@lines ); # find fuzzy scanning the arr +ay of lines print "$pattern\t$fuzzy\n"; }` [download]	[reply] [d/l]
Re^2: Multithreading for perl code by Anonymous Monk on Apr 08, 2015 at 11:10 UTC
Hi, I have tried your idea but there is no difference. The file is about 30 MB of text. Thanks	[reply]
Re^3: Multithreading for perl code by Lotus1 (Vicar) on Apr 08, 2015 at 13:20 UTC
Reading a 30Mb file one time instead of 500 times should be quicker. How about posting what you tried? Maybe there is something you missed in your implementation.	[reply]