in reply to Re^5: how to split huge file reading into multiple threads
in thread how to split huge file reading into multiple threads

you, could whip out some code quickly.

That is my intent, to get something working and see how it fairs. But I'm still not getting a clear picture from your description. A bit confused.

The file handles the main thread is selecting on, are these the same ones that your passed their filenos to threads for duping? Does that mean that the main thread is can-reading on rw dups of teh same files that the threads are writing to?

IF you could knock up a little code to show the flow -- it doesn't have to work, I can knock it into shape -- then it would probably be quicker than me asking 20 questions trying to work out which filehandles do what :)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^6: how to split huge file reading into multiple threads

Replies are listed 'Best First'.
Re^7: how to split huge file reading into multiple threads
by zentara (Cardinal) on Aug 31, 2011 at 17:38 UTC
    IF you could knock up a little code to show the flow -- it doesn't have to work, I can knock it into shape

    Well I ran into some glitches, and the net result is it's probably easiest just to let each thread write to it's own separate file. Of course, your superior insights may see a way out.

    I had to create a dummy file, in order to get some fileno's, and although select does seem to intercept the thread writes, they still go to file directly, so there dosn't seem to be any use for the select, except to intercept the data as it's being written to disk. Also the select seemed to repeat it's data reads, but that probably could be fixed.

    Before I saw the above glitch, my idea was to have each thread search thru it's list for primes, and only print back to main when a prime was found in it's range.

    Conclusion: My original suggestion of letting each thread print to it's own output file, and merging them after script completion, is probably best. Maybe if one used an event loop system, a filehandle watch could be used without the need for a disk file to get a fileno, but then you would be displaying results to a widget of some sort.

    #!/usr/bin/perl use warnings; use strict; use threads; use IO::Select; use FileHandle; my @ranges = ( [1,10000000],[10000001,20000000],[20000001,30000000], [30000001,40000000],[40000001,50000000] ); my $sel = new IO::Select(); # thread launching foreach (@ranges){ my $fh = FileHandle->new(); open ($fh,'+>', './dummyfile'); # needed to get filehandle to give a fileno # maybe better to use IO::Handle and give it # a fileno directly? my $start = $_->[0]; my $end = $_->[1]; my $fileno = fileno($fh); print "$start $end $fileno\n"; threads->create( \&thread, $start, $end, $fileno )->detach; $sel->add($fh); } # watching thread output print "Watching\n\n"; #while( scalar (threads->list) > 0 ){ # dosn't seem to work while(1){ foreach my $h ($sel->can_read){ my $buf; if ( (sysread($h, $buf, 1024) > 0 ) ){ print "Main says: $buf\n"; #truncate $h, 0; # bad idea :-) } } } sub thread{ my( $start, $finish, $fileno ) = @_; open my $fh, ">&=$fileno" or warn $! and die; print $fh "thread# ",threads->tid()," -> $start, $finish, $fileno \n" + ; sleep 5; print $fh "thread# ",threads->tid()," -> finishing \n" ; } __END__

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re^7: how to split huge file reading into multiple threads
by zentara (Cardinal) on Sep 01, 2011 at 13:32 UTC
    Hi again, I figured out a way with pipes. It uses IO::Pipe and is a bit tricky, because you need to pass the IO::Pipe object to the thread, before you call writer or reader on it. Otherwise IO::Pipe throws an error about IO::Pipe::End. But this works as proof of concept. It does not have any disk files involved to avoid the fileno problem. Also, since I detached the threads, you need to hit control-c to exit, or work out a method of detecting when all threads are finished, probably thru a shared variable.
    #!/usr/bin/perl use warnings; use strict; use threads; use IO::Select; use IO::Pipe; my @ranges = ( [1,10000000],[10000001,20000000],[20000001,30000000], [30000001,40000000],[40000001,50000000] ); my $sel = new IO::Select(); # thread launching foreach (@ranges){ my $pipe = IO::Pipe->new(); my $start = $_->[0]; my $end = $_->[1]; print "$start $end $pipe\n"; threads->create( \&thread, $start, $end, $pipe )->detach; # only call reader after pipe has been passed to thread $sel->add( $pipe->reader() ); } # watching thread output print "Watching\n\n"; while(1){ foreach my $h ($sel->can_read){ my $buf; if ( (sysread($h, $buf, 1024) > 0 ) ){ print "Main says: $buf\n"; } } } sub thread{ my( $start, $finish, $pipe ) = @_; my $wh = $pipe->writer; $wh->autoflush(1); print $wh "thread# ",threads->tid()," -> $start, $finish, $pipe \n" +; sleep 5; print $wh "thread# ",threads->tid()," -> finishing \n" ; sleep 2; } __END__

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

      Sorry, but I still do not understand how this in any way helps solve the OPs problem?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        The OP said he was successful in splitting his huge file and putting each chunk into a thread for processing, but was unsure on how to collect the output generated by each thread. He said a database write was unacceptable to his boss.

        The concept of each thread writing to it's own output file and merging them after script completion didn't appear a good solution to him, or else he wouldn't have brought the issue up.

        Consider a situation where the threads process millions of records, but only need to report back a few matches. This method will allow the main thread to collect those matches, and operate on them in realtime. Otherwise the main thread would have to tail the thread's output files.


        I'm not really a human, but I play one on earth.
        Old Perl Programmer Haiku ................... flash japh