Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a perl script that for each sentence of the array1 tries to find nearest sentence of the file $f1. array1 contains about 500 lines while $f1 about 500000 lines. Could you please help me to multithread it because it runs for about 30 minutes? Here is my script:

for my $i (0 .. $#array1) { my $tf = Text::Fuzzy->new ($array1[$i]); my $fuzzy = $tf->scan_file ($f1); # find fuzzy scanning the file print "$array1[$i]\t$fuzzy\n"; }
Thanks for your time

Replies are listed 'Best First'.
Re: Multithreading for perl code
by QM (Parson) on Apr 08, 2015 at 10:43 UTC
    Your loop appears wrong way round. It reads the file 500 times (for an array with 500 elements). Why not read the file line by line, and check each array element?

    Does Text::Fuzzy allow for that?

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Multithreading for perl code
by BrowserUk (Patriarch) on Apr 08, 2015 at 13:52 UTC

    Try it this way (Note:untested):

    #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use Text::Fuzzy; our $semSTDIO :shared; sub tprint { lock $semSTDIO; print @_; } sub worker { my( $Q, $file1 ) = @_; my @array = split "\n", $file1; while( my $work = $Q->dequeue ) { my $tf = Text::Fuzzy->new( $work ); my $index = $tf->nearest( \@array ); tprint( "$work\t$array[ $index ]" ); } } our $T //= 4; ## -T=number of threads to use. our $F1 //= 'big.file'; ## -F1=big.file ## the 500,000 line fil +e our $F2 //= 'small.file'; ## -F2=small.file ## the 500 line file my $Q = new Thread::Queue; my $file1 = do{ local( @ARGV, $/ ) = $F1; <> }; my @threads = map threads->new( \&worker, $Q, $file1 ), 1 .. $T; $Q->enqueue( do{ local @ARGV = $F2; <> }, (undef) x $T ); $_->join for @threads;

    Use as:thisScript -T=8 -F1=big.file -F2=small.file > results.file


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Indeed.

      Before embarking on this suggestion, it might also be useful to time the amount of time it currently takes Text::Fuzzy to process any one file.   If it turns out that (as I frankly expect ...) the time required is “more-or-less 1/500th of more-or-less 30 minutes,” then I predict that your chances of serious improvement are quite good.

      The “meat and potatoes” of this routine are a single “C” module which, from the look of things, does not appear (to me, at first blush) to have any “contentious globals.”   And, if the total size of the file to be processed is “a mere ...” 30 megabytes, on a reasonably-modern machine, the odds are excellent that the entire file will be moved into operating-system buffer memory and that it will stay there ... avoiding actual disk-I/O for all future users.   Thus, as long as(!!) you take due care to be sure that the total memory requirement of all threads is comfortably less-than the available memory on the machine, you ought to be able to reap some very nice improvements here.

      Be sure that each thread, when it has finished doing its appointed unit of work, releases all the memory that it has obtained in the process of doing it.   Experiment, also, with the total number of simultaneous threads that you launch.   There will be a definite “sweet spot.”   As you increase the number of threads, the CPU utilization ought to rise much closer to 100%, but virtual-memory paging activity should not experience any sort of sustained increase.

      Before embarking upon any of this, however, I would suggest simply running two instances of the existing Perl script, in two different “command-line windows” at the same time.   The time-command output (in Unix/Linux) of these, when added together, should be “noticeably less than” twice the amount of time that the same command took when run by itself.   (Duly allowing for the fact that, in this case, they are of course separate processes, not just threads.)   If this appears to be the case, then the prospects of multithreading are probably worth pursuing.

Re: Multithreading for perl code
by pme (Monsignor) on Apr 08, 2015 at 10:38 UTC
    How big is your file? If you have enough memory in your computer then you can simply read the file into an array and then use 'nearest' method.
    my @lines = <STDIN>; foreach my $pattern (@array1) { my $tf = Text::Fuzzy->new ($pattern); my $fuzzy = $tf->nearest( \@lines ); # find fuzzy scanning the arr +ay of lines print "$pattern\t$fuzzy\n"; }

      Hi,

      I have tried your idea but there is no difference. The file is about 30 MB of text.

      Thanks

        Reading a 30Mb file one time instead of 500 times should be quicker. How about posting what you tried? Maybe there is something you missed in your implementation.