in reply to Re: Using kernel-space threads with Perl
in thread Using kernel-space threads with Perl

Each thread does not necessarily need to see all the data. The job is embarrassingly parallelizable, so I was going to simply carve it up into pieces. The problem was getting each of the pieces into the threads. If I split up the data before hand, then all of it still gets copied into the threads. Maybe I'm using the wrong strategy?

Thanks!
  • Comment on Re^2: Using kernel-space threads with Perl

Replies are listed 'Best First'.
Re^3: Using kernel-space threads with Perl
by BrowserUk (Patriarch) on Mar 22, 2011 at 00:19 UTC
    The job is embarrassingly parallelizable, so I was going to simply carve it up into pieces.

    Then it may be possible to do something useful. It depends on wher you are getting the data from and how it can be subdivided.

    A little more information about the data and the processing to be performed on that data might suggest a better technique.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Using kernel-space threads with Perl (order)
by tye (Sage) on Mar 22, 2011 at 00:44 UTC

    Create the threads first and then have each thread load just the data it needs (and don't share it, of course). Then there won't be extra copies of that stuff created.

    - tye        

      Would you care to expand that a little?

      Say, a little pseudo-code showing how you would manage the threads reading from the same file concurrently?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I didn't see any mentions of files so I wasn't going to jump to the conclusion that "huge dataset" means some single huge file or even any files at all.

        "The job is embarrassingly parallelizable, so I was going to simply carve it up into pieces. The problem was getting each of the pieces into the threads. If I split up the data before hand, then all of it still gets copied into the threads."

        So the OP already has an idea of how to "carve up" the data and it would seem that reading in the data isn't unacceptably slow as-is, it is just the running out of memory when creating iThreads after that. So I don't see why you jump to the conclusion of wanting to read the data in parallel either.

        If you have your heart set on writing some pseudo code for loading the data, then you'll need to await the "more information" that you already asked for. In the mean time, the answer I provided may well be enough for the OP to adjust the way he already knows how to load the data so that much less memory is required.

        To amplify what JavaFan mentioned, there is a module, forks.pm that will allow "copy on write" sharing of the loaded data. Exactly how the data is loaded and used might mean that this is an insignificant advantage in the long run but it is also trivial to try (if you aren't running in MS Windows) and might make a huge difference.

        Having the parent read in the data and hand off each piece to the appropriate thread(s) (I'm guessing via Thread::Queue might be a good way) is the most general method that springs to my mind. I'd probably do something similar except using processes and simple pipes, as I've often done.

        - tye