Boetsie has asked for the wisdom of the Perl Monks concerning the following question:

Dear perlmonks, i've a simple question. I have some perl code which i want to make faster. My thoughts would be to multithread some of my code. However, i did not succeed yet. In addition, i don't know if my code is possible to thread and if it has benefits. A part of my code is for example;
while(<FILE>) { DO STUFF FOR EACH LINE (e.g. search for specific characters within th +e line) }
which i want to multithread;
while(<FILE>) { DO STUFF FOR EACH LINE using multithread }
my question is how to do this and if it would have any benefit in terms of speed. My files are very large, usually several gigabases. Kind regards, Boetsie

Replies are listed 'Best First'.
Re: threading a perl script
by BrowserUk (Patriarch) on Apr 22, 2011 at 16:16 UTC

    Contrary to popular opinion, there is some scope for performance gains through threading with this pattern of application usage.

    Whether those performance gains are realisable, or worth the effort of doing so, depends entirely upon what DO STUFF FOR EACH LINE actually consists of?

    For example, in the following, the one-liner simply reads a 1GB/16 million line file with no further processing. This forms a base line for how fast you can get the data from disk into memory:

    C:\test>timeit \perl64\bin\perl.exe -nle1 1GB.dat Took: 9.687471720 seconds C:\test>timeit \perl64\bin\perl.exe -nle1 1GB.dat Took: 9.544258000 seconds C:\test>timeit \perl64\bin\perl.exe -nle1 1GB.dat Took: 9.708372520 seconds

    As you can see, on my system that baseline is reasonably consistently something just under 10 seconds.

    In the following, reading the same file, but this time performing about the simplest search and action possible:

    C:\test>timeit \perl64\bin\perl.exe -nle\"/\*/ and ++$c }{ print $c\" +1GB.dat 10217571 Took: 11.682412240 seconds C:\test>timeit \perl64\bin\perl.exe -nle\"/\*/ and ++$c }{ print $c\" +1GB.dat 10217571 Took: 11.904963960 seconds C:\test>timeit \perl64\bin\perl.exe -nle\"/\*/ and ++$c }{ print $c\" +1GB.dat 10217571 Took: 12.945696440 seconds C:\test>timeit \perl64\bin\perl.exe -nle\"/\*/ and ++$c }{ print $c\" +1GB.dat 10217571 Took: 12.128623440 seconds

    There is consistently 2 to 2.5 seconds of processing time. In theory, if you could split that processing time across 2 cores, you could reduce the runtime by around a second.

    And that is a conservative estimate, because most of the time attributed to IO, is time spent waiting with an idle cpu. so, if you could utilise that idle cpu to perform the processing of the current record, whilst the system fetches the next, you could (theoretically) get the elapsed time back down close to the IO baseline and save the full 2 to 2.5 seconds.

    That said, achieving that using Perl's current threading mechanisms--it is the state sharing mechanisms that are the bottleneck--is very difficult, bordering on the impossible. It is quite difficult to achieve in any language not just Perl.

    And to achieve it requires that you make a very simple program considerably more complicated. So you have to ask yourself, is a (at best) 2.5 second saving on 12.5 seconds worth that effort and complication? For the case of the simple example above, almost certainly not. For a 20% saving to be of significant value it would have to be a pretty big file.

    But, if the processing of each line is substantially more complex, sufficient to raise the IO to cpu ratio significantly, and the files were large enough to make the overlapping of those two components worth the effort involved, then there is some scope for doing that using threading.

    Mind you, it would be simpler still if Perl gave us access to the asynchronous/overlapped IO facilities available in most modern OSs, but that is a different discussion.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Good message, pleasant reading.

      Now, given that in a 10-second processing, gives a saving 2 seconds by parallelizing algorithm using threads - makes me think, that thread usage is even narrower than I initially thought.

      Threaded perl some 10-15% slower than unthreaded one, and we're paying this price every time to eventually have a possibility to save 20% of time?
      (after reorganizing flow of program, which will give some 80% or more of added complexity :) )

      Add to this that most GUI libraries (actually, all GUI libraries) are also single-threaded,

      Unthreaded perl rulez! :)

      Vadim.

        If that is all you took from my post, then you shouldn't be celebrating.

        If all the programs you write do nothing more complicated than wc or fgrep; if in your world a 1GB file represents nothing more important than say 1 day's twaterrings, and you only need to count the number of '!'s used; if between reading and writing your important data you need to do nothing of any consequence; then stick with your un-threaded perl, because it can do nothing very, very quickly.

        On the other hand, if you are (say) a bio-geneticist. And that 1GB of data represents 10,000 x 100k base sequences, each of which need to be fuzzy matched at every offset, against each of 25,000 x 25-base sub-sequences; a process that on a single core takes a week or more, then you'll be rather more interested in using perl's threading to reduce that to under a day using a commodity 8-core box. And even more grateful for that threading when you're given access to the departmental 128-core processor, and your script completes in under 2 hours with nothing more than a change of command line argument.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: threading a perl script
by moritz (Cardinal) on Apr 22, 2011 at 15:27 UTC

    If your operations are simple string operations, threading or any kind of parallelism will not help at all, because IO takes much more time than the actual processing.

    So, get a faster hard drive, or even better SSD.

Re: threading a perl script
by igelkott (Priest) on Apr 22, 2011 at 15:32 UTC

    Ignoring the question of whether you should use threads, see threads for how you could use them.

Re: threading a perl script
by Corion (Patriarch) on Apr 22, 2011 at 15:27 UTC

    The easiest way would be to process more than one file, by launching your program several times, for example through runN.

    Other than that, most likely, the processing your program does within the loop is likely dwarfed by the time needed to read the next line. Adding more threads there won't speed it up.

    Maybe you can reduce the time needed for IO by compressing your input files and reading them through a gzip (or bzip2) pipe:

    open my $fh, "gzip -cd '$file' |" or die "Couldn't read '$file': $!"; while (<$fh>) { ... };

    This shifts the time to read the data from "processing" to when the data arrives. Most likely, you won't gain much unless you need to read the data more than once. Then, maybe also creating an index file or storing the data in a database will speed things up.

    Update: Fixed link to runN

Re: threading a perl script
by educated_foo (Vicar) on Apr 22, 2011 at 15:41 UTC
    See Perl Tops "Wide Finder" Results for a discussion of parallelizing some line-by-line processing using processes rather than threads. Unless you're on Windows, processes are usually easier.

      I'm not really sure how "using processes" is hard on Windows. Just system (resp. system( 1, ...) launches many processes quite fast.

Re: threading a perl script
by anonymized user 468275 (Curate) on Apr 22, 2011 at 16:52 UTC
    There would be no point in having lots of threads per page of buffered I/O being fetched internally in the operating system - this would if anything be slower. The default buffered I/O page size for unix is 4096 bytes. But the processing of a 4096 byte buffer is apt to take less time than reading it from disk. So the threads would have to wait for each other anyway if they are allocated different pages from the same I/O stream (no gain there). Unless an extra pipe is inserted between the file I/O and the process, in which case the I/O system can make it's best effort to pump data down the pipe while your threads pick up 4K chunks and run with them to free the pipe more regularly for new data to replace it. But threads each seeking http://perldoc.perl.org/functions/seek.html to a different 4096 byte boundary of the shared filehandle need not work as expected. Forks would at least have the advantage of avoiding potential competition for internal I/O resources that I fear threads would encounter - somehow I feel that therefore forks offer a more hopeful scenario. But, given that there are more process overheads to forks, a trade-off multiplier (greater than 1) of how many 4K pages per fork will be optimal needs to be calculated. That might require experimentation.

    One world, one people

Re: threading a perl script
by Anonymous Monk on Apr 25, 2011 at 14:51 UTC
    "Don't diddle code to make it faster .. choose a better algorithm." -- The Elements of Programming Style

      And if you're already using the best algorithm available and it is still too slow?

        Thank you all for the kind replies, i'll not include threading in my script after each time i read in a line. However, i'll try to thread multiple files, since i have multiple large files all to be processed the same way. I think i can thread these together, am i right?

        Next question; how can I thread for example 3 files and then wait till one is finished and then the next file can be processed. In other words; how can i process, for example, always 3 files at the same time?

        Kind regards, Boetsie