in reply to Re: how to split huge file reading into multiple threads
in thread how to split huge file reading into multiple threads

'split' is a good utility and splits the files easily and fast. I can now split the file into some 5-6 files and then can create threads on that. However, the files that all these threads will be writing data to are the same. - How can I expedite it ?

  • Comment on Re^2: how to split huge file reading into multiple threads

Replies are listed 'Best First'.
Re^3: how to split huge file reading into multiple threads
by zentara (Cardinal) on Aug 30, 2011 at 11:07 UTC
    However, the files that all these threads will be writing data to are the same. - How can I expedite it ?

    Maybe write the results to separate files, and merge the results when finished? Or maybe use separate dbm files, see merging dbm files


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re^3: how to split huge file reading into multiple threads
by zentara (Cardinal) on Aug 30, 2011 at 22:35 UTC
    It dawned on me there is another way to collect the output from the threads. You can open a filehandle in the main thread, pass it's fileno to the thread, then let the thread write to the dup'd filehandle. See [threads] Open a file in one thread and allow others to write to it for the technique.

    Anyways, you could open 1 filehandle for each thread, for that thread to report results back to the main thread. Pass the fileno of that filehandle to each thread at creation time. In the main thread, setup an IO::Select object to watch all the filehandles. Have the main thread open a private filehandle for the final output file, and as IO::Select reads the various data from each thread, it writes the output to the output file.

    This would allow the threads to write without worrying about locking, while the main thread's select loop would actually handle the writing, and possibly sorting, the data out to file.

    I don't know how it would work speedwise, as select will block if one thread reports alot of data, but this might be minimized by using large filehandle buffers.

    That is what I would try first.


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

      Could you explain a little more about how you envisage this working please?

        Foreach chunk of the split file, a thread will be created. In the thread creation loop, each thread will get passed the chunk filename, and a unique fileno. The fileno will be derived from a rw (+>) filehandle created in the loop, 1 for each thread. In the main thread, each of those filehandles would be added to an IO::Select object, and after the main thread's worker-thread-creation loop is finished, the main thread would sit in a loop watching the IO::Select object. The threads, would dup the fileno's for writing, and write their output there.

        The idea would probably work also for forked worker processes, but you would need to pass the $pid of the parent process as well as a fileno; since filehandles used by the same owner are writable by all processes of that owner.

        The IO::Select loop in the main thread would be similar in setup, to a socket-watch program. As the data comes in to $select->can_read, it will read the data( preferably with sysread in huge chunks), and just copied to an output filehandle.

        A few points the OP would have to watch are

        1.Making sure the original huge file split dosn't split in the middle of a line, rendering a few records broken.

        2. Making sure that IO::Select dosn't clog up and slowdown the output of some threads, by 1 overly aggressive thread outputting too much and hogging the Select object. One possible solution would be to use the largest filehandle buffers possible on the platform, so slower threads can keep outputting to the buffers, if one thread's output becomes very heavy.

        The code should be fairly straightforward, and possibly someone as agile with thread code as you, could whip out some code quickly. For me, it would take all morning, and I prefer f'ing off. :-)


        I'm not really a human, but I play one on earth.
        Old Perl Programmer Haiku ................... flash japh