in reply to Re^6: how to split huge file reading into multiple threads
in thread how to split huge file reading into multiple threads

Hi again, I figured out a way with pipes. It uses IO::Pipe and is a bit tricky, because you need to pass the IO::Pipe object to the thread, before you call writer or reader on it. Otherwise IO::Pipe throws an error about IO::Pipe::End. But this works as proof of concept. It does not have any disk files involved to avoid the fileno problem. Also, since I detached the threads, you need to hit control-c to exit, or work out a method of detecting when all threads are finished, probably thru a shared variable.
#!/usr/bin/perl use warnings; use strict; use threads; use IO::Select; use IO::Pipe; my @ranges = ( [1,10000000],[10000001,20000000],[20000001,30000000], [30000001,40000000],[40000001,50000000] ); my $sel = new IO::Select(); # thread launching foreach (@ranges){ my $pipe = IO::Pipe->new(); my $start = $_->[0]; my $end = $_->[1]; print "$start $end $pipe\n"; threads->create( \&thread, $start, $end, $pipe )->detach; # only call reader after pipe has been passed to thread $sel->add( $pipe->reader() ); } # watching thread output print "Watching\n\n"; while(1){ foreach my $h ($sel->can_read){ my $buf; if ( (sysread($h, $buf, 1024) > 0 ) ){ print "Main says: $buf\n"; } } } sub thread{ my( $start, $finish, $pipe ) = @_; my $wh = $pipe->writer; $wh->autoflush(1); print $wh "thread# ",threads->tid()," -> $start, $finish, $pipe \n" +; sleep 5; print $wh "thread# ",threads->tid()," -> finishing \n" ; sleep 2; } __END__

I'm not really a human, but I play one on earth.
Old Perl Programmer Haiku ................... flash japh

Replies are listed 'Best First'.
Re^8: how to split huge file reading into multiple threads
by BrowserUk (Patriarch) on Sep 02, 2011 at 02:21 UTC

    Sorry, but I still do not understand how this in any way helps solve the OPs problem?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      The OP said he was successful in splitting his huge file and putting each chunk into a thread for processing, but was unsure on how to collect the output generated by each thread. He said a database write was unacceptable to his boss.

      The concept of each thread writing to it's own output file and merging them after script completion didn't appear a good solution to him, or else he wouldn't have brought the issue up.

      Consider a situation where the threads process millions of records, but only need to report back a few matches. This method will allow the main thread to collect those matches, and operate on them in realtime. Otherwise the main thread would have to tail the thread's output files.


      I'm not really a human, but I play one on earth.
      Old Perl Programmer Haiku ................... flash japh

        Hm. I guess I read the OPs post differently.

        His task description is: read records from a (single) huge file, and write them to one of many (600) output files depending upon their contents. He asked how he could use thread to improve the performance.

        You suggested splitting the huge file into several smaller files so that each thread could work on a different part.

        He pointed out that would mean he would have many threads writing to each of the output files.

        You are suggesting that he has many pipes and another thread running a select loop to coalesce the records for each output file before writing them.

        Lets say he has split the huge file into 10 parts and he runs 10 threads. Using your schema, he would require one pipe for each of the 600 output files in each of the 10 threads; and another 600 threads running select loops to coalesce the records and write them to the 600 output files. So 610 threads and 6000 pipes!

        And that's before we consider that he has exasperated the problem by reading the input from 10 separate files concurrently, which will cause the read head to be dancing all over the disk just to get the input.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.