Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am here to seek your wisdom in the following matter. I am developiong a program which reads Large (upto 2 Gb) input files (upto a max of 5 files. each containing the same type of data) and splits it into 18 different parts ..relevant info going to relevant file. There can be upto 5 files for the type I am looking at. As of now it is a non-threaded process. And I am thinking about making it multi threaded. to speed up the process. However, when I tested the potential benefit for the same using the below test program that uses threads... (reads 5 files and outputs slightly manipulated contents to o/p file) this infact shows me that use threads has even slowed down the process. Env : Perl 5.8.8 build 824 Win XP Code :
#!/usr/bin/perl use threads; use Benchmark qw(:all) ; my $line_var :shared = 0; sub main_func { my ($tid, $in_fh , $out_fh, $start , $stop) = @_; # Synchronised block ++$line_var; while (<$in_fh>) { print $out_fh " LineVar.. $line_var\t" . $_ ; } return $tid; } sub super_main { open (OUTFH1 , "< out1.txt"); open (OUTFH2 , "< out2.txt"); open (OUTFH3 , "< out3.txt"); open (OUTFH4 , "< out4.txt"); open (OUTFHO1 , "> outO1.txt"); open (OUTFHO2 , "> outO2.txt"); open (OUTFHO3 , "> outO3.txt"); open (OUTFHO4 , "> outO4.txt"); $thr1 = threads->create(\&main_func , '1', OUTFH1,OUTFHO1, '1' , ' +1000000'); $thr2 = threads->create(\&main_func , '2', OUTFH2,OUTFHO2, '100000 +0' , '2000000'); $thr3 = threads->create(\&main_func , '3', OUTFH3,OUTFHO3,'2000000 +' , '3000000'); $thr4 = threads->create(\&main_func , '4', OUTFH4,OUTFHO4,'3000000 +' , '4000000'); $tid1 = $thr1->join(); $tid2 = $thr2->join(); $tid3 = $thr3->join(); $tid4 = $thr4->join(); } sub main_func2 { my $line_var2 = 0; open (OUTFHO5 , "> outO5.txt"); my ($tid, $out_fh ,$start , $stop) = (5,OUTFHO5,'1' , '4000000'); for ($i = 1 ; $i<5; $i++) { open (INFH , "< out$i.txt"); while (<INFH>) { $line_var2++; print $out_fh " Line.. $line_var2\t" . $_ ; } } return $tid; } #timethese ( 20, # {'before' => \&main_func2 , # 'after' => \&super_main } # ); cmpthese ( 20, {'before' => \&main_func2 , 'after' => \&super_main } );
Bothe timethese & compthese show poor performance for after..
---------- Perl ---------- s/iter after before after 10.9 -- -18% before 8.99 22% -- Output completed (12 min 45 sec consumed) - Normal Termination
So the question is... Have I made a mistake in the program... OR is threading not tht beneficial ??

Replies are listed 'Best First'.
Re: Threads Doubt
by BrowserUk (Patriarch) on Oct 17, 2008 at 15:28 UTC

    If your machine does not have multiple processors, then all your threads will be sharing the time on the same processor, so there is no advantage. But, by using multiple threads, you incur the overhead of the threading support itself, so a net loss in throughput.

    However, even if you have multiple processors, if your files all reside on the same drive, then by using multiple threads, you are causing the read heads to jump around all over the disk in order to try and supply the separate threads with data, and you are again incurring overhead not present with the single-threaded process.

    The only way you will see benefit from threading this kind of IO-bound processing, is if you have multiple processors, and can arrange for files being read and/or written to reside on different, local disks. And note: different physical drives, not different logical partitions of the same drive. Even then, the spitting of the system filecache between different concurrent files is likely to hit the throughput more than any gains you might achieve.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      For large files like this, the read heads will still be "jumping around" anyway, because unless you just de-fragged your disk, the file is likely to be physically spread out. Even putting files on separate disks may not improve performance all that much, because your actual IO rate depends on many things:
      1. Individual disk performance. RPMs, local buffer, etc
      2. RAID? Mirroring can typically support twice the read performance
      3. Controller type. SCSI is better for multi-tasking than EIDE. However SCSI supports lots more devices, so if it has 7 devices all accessed simultaneously, you're no better off
      4. Bus speed and contention.
      5. (Most important) Application contention. What else on your system is trying to use the same disks? Are they shared?
      In general, your best bet for IO performance is to make sure that when you can read, read as much as you can.
        For large files like this, the read heads will still be "jumping around" anyway, because unless you just de-fragged your disk, the file is likely to be physically spread out.

        Yes I know. But, if you are trying to read from 5 files concurrently, your read heads are going to be jumping around far more than if you are only reading from one file. (All else being equal.)

        And depending upon your OS and filing system, 5 concurrent readers means far less system cache devoted to each file, which will further decrease throughtput. Then there are factors such as on disk caching and myriad other hardware and software related factors.

        But as a cojent, if simplified, explanation of why multithreading can have a negative affect on the throughput of the OPs application, I think my post stands on it's own.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Threads Doubt
by perrin (Chancellor) on Oct 17, 2008 at 14:27 UTC
Re: Threads Doubt
by Illuminatus (Curate) on Oct 17, 2008 at 15:19 UTC
    Line-based reading is not, in general, very efficient from an IO standpoint. It is buffered IO, but it typically only reads in about 64K blocks. You should see performance improvement by using something like:
    while (sysread (MYFILE, $myBuffer, 10*1024*1024) > 0) { # process buffer line by line }
    regardless of whether you thread or not. My guess is that using 2 threads in this manner would be more efficient, but only if you're on a dual-processor system (I'm not even sure a hyper-threaded/dual-core processor would help).
Re: Threads Doubt
by zentara (Cardinal) on Oct 17, 2008 at 13:56 UTC
    So the question is... Have I made a mistake in the program... OR is threading not that beneficial ??

    Threading won't speed things up, when disk i/o is the main bottleneck, as seems to be in your case. It will actually slow things down, because of the overhead that threads impose.

    Threads are only really useful when you need realtime sharing of data between threads, and a few other cases like watching STDIN for you.

    Your best bet for speeding things up is to put the huge files on separate hard disks, or maybe usb drives.


    I'm not really a human, but I play one on earth Remember How Lucky You Are