in reply to how to split huge file reading into multiple threads

Can we use threads in such a way that, multiple threads will be acting on the million record file so that the time is reduced to few minutes. ?

Probably not, since all the threads will be hitting a bottleneck of trying to read the same file at the same time. The limiting factor is how fast your hard drive is. See How do you parallelize STDIN for large file processing? and Is Using Threads Slower Than Not Using Threads? for examples.

There may be some improvements gained if you could use a program like split to break your huge file into smaller chunks, and place them on separate hard drives, then let your parallel processes work on them.


I'm not really a human, but I play one on earth.
Old Perl Programmer Haiku ................... flash japh
  • Comment on Re: how to split huge file reading into multiple threads

Replies are listed 'Best First'.
Re^2: how to split huge file reading into multiple threads
by sagarika (Novice) on Aug 30, 2011 at 09:31 UTC

    'split' is a good utility and splits the files easily and fast. I can now split the file into some 5-6 files and then can create threads on that. However, the files that all these threads will be writing data to are the same. - How can I expedite it ?

      However, the files that all these threads will be writing data to are the same. - How can I expedite it ?

      Maybe write the results to separate files, and merge the results when finished? Or maybe use separate dbm files, see merging dbm files


      I'm not really a human, but I play one on earth.
      Old Perl Programmer Haiku ................... flash japh
      It dawned on me there is another way to collect the output from the threads. You can open a filehandle in the main thread, pass it's fileno to the thread, then let the thread write to the dup'd filehandle. See [threads] Open a file in one thread and allow others to write to it for the technique.

      Anyways, you could open 1 filehandle for each thread, for that thread to report results back to the main thread. Pass the fileno of that filehandle to each thread at creation time. In the main thread, setup an IO::Select object to watch all the filehandles. Have the main thread open a private filehandle for the final output file, and as IO::Select reads the various data from each thread, it writes the output to the output file.

      This would allow the threads to write without worrying about locking, while the main thread's select loop would actually handle the writing, and possibly sorting, the data out to file.

      I don't know how it would work speedwise, as select will block if one thread reports alot of data, but this might be minimized by using large filehandle buffers.

      That is what I would try first.


      I'm not really a human, but I play one on earth.
      Old Perl Programmer Haiku ................... flash japh

        Could you explain a little more about how you envisage this working please?