Foreach chunk of the split file, a thread will be created. In the thread creation loop, each thread will get passed the chunk filename, and a unique fileno. The fileno will be derived from a rw (+>) filehandle created in the loop, 1 for each thread. In the main thread, each of those filehandles would be added to an IO::Select object, and after the main thread's worker-thread-creation loop is finished, the main thread would sit in a loop watching the IO::Select object. The threads, would dup the fileno's for writing, and write their output there.

The idea would probably work also for forked worker processes, but you would need to pass the $pid of the parent process as well as a fileno; since filehandles used by the same owner are writable by all processes of that owner.

The IO::Select loop in the main thread would be similar in setup, to a socket-watch program. As the data comes in to $select->can_read, it will read the data( preferably with sysread in huge chunks), and just copied to an output filehandle.

A few points the OP would have to watch are

1.Making sure the original huge file split dosn't split in the middle of a line, rendering a few records broken.

2. Making sure that IO::Select dosn't clog up and slowdown the output of some threads, by 1 overly aggressive thread outputting too much and hogging the Select object. One possible solution would be to use the largest filehandle buffers possible on the platform, so slower threads can keep outputting to the buffers, if one thread's output becomes very heavy.

The code should be fairly straightforward, and possibly someone as agile with thread code as you, could whip out some code quickly. For me, it would take all morning, and I prefer f'ing off. :-)


I'm not really a human, but I play one on earth.
Old Perl Programmer Haiku ................... flash japh

In reply to Re^5: how to split huge file reading into multiple threads by zentara
in thread how to split huge file reading into multiple threads by sagarika

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.