aberman has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a complex problem that I'm hoping has a simple answer. I have a very large dataset (~3.5GB) that I need to parallel process on a multi-core machine. In trying to use threads, I found that Perl creates a copy of the interpreter for each thread. This is a giant problem with a dataset this large, because it copies the entire contents of the memory into each thread, thus I run out of RAM very quickly. On linux, kernel threads allow the processes to share the same memory allocation, and it is up to you to lock appropriately using semaphores or whatever. Is this possible in Perl, or is there a strategy I can use that won't involve copying the entire contents of memory into each iThread? I would hate to have to go use Java or C to do this. Up until now, I was convinced Perl was unstoppable, please prove me right. :)


Thanks in advance!

--Ari

Replies are listed 'Best First'.
Re: Using kernel-space threads with Perl
by BrowserUk (Patriarch) on Mar 21, 2011 at 23:48 UTC
    Is this possible in Perl, or is there a strategy I can use that won't involve copying the entire contents of memory into each iThread?

    Unfortunately, there is no way to do this efficiently in Perl if every thread needs to be able to access all the data.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Each thread does not necessarily need to see all the data. The job is embarrassingly parallelizable, so I was going to simply carve it up into pieces. The problem was getting each of the pieces into the threads. If I split up the data before hand, then all of it still gets copied into the threads. Maybe I'm using the wrong strategy?

      Thanks!
        The job is embarrassingly parallelizable, so I was going to simply carve it up into pieces.

        Then it may be possible to do something useful. It depends on wher you are getting the data from and how it can be subdivided.

        A little more information about the data and the processing to be performed on that data might suggest a better technique.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Create the threads first and then have each thread load just the data it needs (and don't share it, of course). Then there won't be extra copies of that stuff created.

        - tye        

Re: Using kernel-space threads with Perl
by hermida (Scribe) on Mar 22, 2011 at 13:40 UTC

    If the data that needs to be shared is too large for RAM another good option is to use a fast DBM like KyotoCabinet and in your Perl program just use forks with a library like Parallel::Forker, Parallel::ForkManager, Parallel::Prefork, etc.

    I've done things this way and it works really fast as long as you aren't constantly writing to/changing the shared data (with KyotoCabinet you can specify the size of the RAM cache but the rest of the DBM is on the filesystem will be slower). If you are creating the shared data structure once and then reading from it with your forks it can be nearly as fast as RAM.

Re: Using kernel-space threads with Perl
by JavaFan (Canon) on Mar 22, 2011 at 01:40 UTC
    You may also consider using forks. Modern OSses give the appearance all the data is copied, but they'll implement it using COW, so it reality, data is only copied when it's rewritten in a process.

    Of course, as pointed out elsewhere in the thread, first creating threads (or processes) and only then reading in the data wins.

      data is only copied when it's rewritten

      Or treated as a number or ...


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        But I suspect the real killer is when just inc/decrementing a ref count causes an entire page of memory to no longer be shared. But, despite this, I have seen some evidence of caches of Perl data staying partially shared, despite my expectation that this is way too easy to thwart. The case of each child not even looking at most of the data does make the odds improve some.

        If I wanted to share lots of data between Perl child processes, I'd probably at least consider storing that data via Judy.

        - tye        

Re: Using kernel-space threads with Perl
by zentara (Cardinal) on Mar 22, 2011 at 17:16 UTC
    or is there a strategy I can use that won't involve copying the entire contents of memory into each iThread?

    Sure. You can create a shared memory segment, and have your threads or forks read from the shmem. See SysV shared memory --pure perl for a rudimentary example

    Or, you could create a 3.5 gig ramdisk, and have your workers read from it.


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
      This is all very good information (as expected). This is my first real try at parallelizing a large problem, so I'm sure I've made some mistakes with it. I'll try to implement some of these ideas later this week and get back to the thread on it. I'll also post any resulting (working) code at the end to help out others with this problem. Thanks all!