in reply to Re^3: how to split huge file reading into multiple threads
in thread how to split huge file reading into multiple threads
Could you explain a little more about how you envisage this working please?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: how to split huge file reading into multiple threads
by zentara (Cardinal) on Aug 31, 2011 at 11:21 UTC | |
The idea would probably work also for forked worker processes, but you would need to pass the $pid of the parent process as well as a fileno; since filehandles used by the same owner are writable by all processes of that owner. The IO::Select loop in the main thread would be similar in setup, to a socket-watch program. As the data comes in to $select->can_read, it will read the data( preferably with sysread in huge chunks), and just copied to an output filehandle. A few points the OP would have to watch are 1.Making sure the original huge file split dosn't split in the middle of a line, rendering a few records broken. 2. Making sure that IO::Select dosn't clog up and slowdown the output of some threads, by 1 overly aggressive thread outputting too much and hogging the Select object. One possible solution would be to use the largest filehandle buffers possible on the platform, so slower threads can keep outputting to the buffers, if one thread's output becomes very heavy. The code should be fairly straightforward, and possibly someone as agile with thread code as you, could whip out some code quickly. For me, it would take all morning, and I prefer f'ing off. :-) I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh | [reply] |
by BrowserUk (Patriarch) on Aug 31, 2011 at 12:39 UTC | |
you, could whip out some code quickly. That is my intent, to get something working and see how it fairs. But I'm still not getting a clear picture from your description. A bit confused. The file handles the main thread is selecting on, are these the same ones that your passed their filenos to threads for duping? Does that mean that the main thread is can-reading on rw dups of teh same files that the threads are writing to? IF you could knock up a little code to show the flow -- it doesn't have to work, I can knock it into shape -- then it would probably be quicker than me asking 20 questions trying to work out which filehandles do what :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by zentara (Cardinal) on Aug 31, 2011 at 17:38 UTC | |
Well I ran into some glitches, and the net result is it's probably easiest just to let each thread write to it's own separate file. Of course, your superior insights may see a way out. I had to create a dummy file, in order to get some fileno's, and although select does seem to intercept the thread writes, they still go to file directly, so there dosn't seem to be any use for the select, except to intercept the data as it's being written to disk. Also the select seemed to repeat it's data reads, but that probably could be fixed. Before I saw the above glitch, my idea was to have each thread search thru it's list for primes, and only print back to main when a prime was found in it's range. Conclusion: My original suggestion of letting each thread print to it's own output file, and merging them after script completion, is probably best. Maybe if one used an event loop system, a filehandle watch could be used without the need for a disk file to get a fileno, but then you would be displaying results to a widget of some sort.
I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh | [reply] [d/l] |
by zentara (Cardinal) on Sep 01, 2011 at 13:32 UTC | |
I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh | [reply] [d/l] |
by BrowserUk (Patriarch) on Sep 02, 2011 at 02:21 UTC | |
by zentara (Cardinal) on Sep 02, 2011 at 09:14 UTC | |
| |