in reply to Re^5: Using threads to run multiple external processes at the same time
in thread Using threads to run multiple external processes at the same time

It seems that you were right.

I found something relevant on the R mailing lists:

"> Specifically if it is possible to ask R to run a given R-program > from withing a posix thread (on linux) without providing a Mutex tha +t > would serialise access to R process. No. You need to make sure that only one thread calls R, which means ha +ving some sort of handler to queue the commands."

This means that the threaded approach is useless. Back to square one.
  • Comment on Re^6: Using threads to run multiple external processes at the same time
  • Download Code

Replies are listed 'Best First'.
Re^7: Using threads to run multiple external processes at the same time
by BrowserUk (Patriarch) on Sep 05, 2009 at 00:12 UTC
    This means that the threaded approach is useless.

    Hm. I'm not sure that is true.

    It's unclear to me from the 3 posts in that thread whether they are talking about talking to multiple processes from different threads--as you are trying to do--or whether they are talking about talking to R.dll from multiple threads when embeding R in a C/C++ program.

    At one point the OP talks of "calling R", at another "the R process". And most of the "threads" discussion by the 2 experts seems to be talking about threading R internally--ie. within a single R process--rather than having two process instances running concurrently.

    I remember many of the dlls in OS/2 v1.x were inherently thread-unsafe, mostly because they were written in C by ex-COBOL programmers who hadn't quite gotten over the 'static data section' way of thinking. But I didn't think anyone still coding for a living was still doing stuff like that.

    By far the simplest way of verifying this would be run something in each of two concurrent interactive sessions that takes an appreciable amount of time--a minute or two--and see if the time is overlapped or serialised. I have two Rgui sessions running now, but I don;t know enough about R to come up with something that doesn't complete instantaneously :(


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I'll look into it, problem is that I can do it only when I'm back at work on Monday.

      Meanwhile I'll try to ask around on the R-help mailing list, maybe they will be able to clarify the issue.

      And I think I'll write a "Tarzan-be-strong-Tarzan-makes-other-hole" version of my program that uses a shared directory to distribute workload to networked clients, like I was advised earlier in this thread. I should probably be working on something useful and productive instead, but I'm a stubborn bastard and I really want to solve this problem now.

        I think that I've pretty much confirmed that R does some serialisation of resource usage across concurrent process instances. Even though the evidence is not as clear cut as I would like. In the following console trace, I start 1, 2, 3 & 4 concurrent copies of R performing a ~60 second calculation and timing it with its own built-in timer. I set the affinity of each process to 1 of my cpus to prevent cpu thrash:

        for /l %i in (1,1,1) do @start /b /affinity %i rscript -e"system.time(for(i in 1:1e4){fisher.test(matrix(c(200,300,400,500),n +row=2))})" user system elapsed 62.24 0.00 62.68 for /l %i in (1,1,2) do @start /b /affinity %i rscript -e"system.time(for(i in 1:1e4){fisher.test(matrix(c(200,300,400,500),n +row=2))})" user system elapsed 65.49 0.00 65.60 user system elapsed 65.19 0.01 66.13 for /l %i in (1,1,3) do @start /b /affinity %i rscript -e"system.time(for(i in 1:1e4){fisher.test(matrix(c(200,300,400,500),n +row=2))})" user system elapsed 65.61 0.06 65.94 user system elapsed 65.75 0.03 98.98 user system elapsed 65.55 0.00 99.30 for /l %i in (1,1,4) do @start /b /affinity %i rscript -e"system.time(for(i in 1:1e4){fisher.test(matrix(c(200,300,400,500),n +row=2))})" user system elapsed 68.83 0.00 69.81 user system elapsed 70.59 0.00 72.71 user system elapsed 67.30 0.03 101.99 user system elapsed 67.22 0.00 102.65
        1. For 1 copy, it maxes out the appropriate cpu for ~62 seconds, and the elapsed time closely reflects the cpu time used.
        2. For 2 copies, it almost maxes out the two cpus, but both processes show an ~5% 'concurrency overhead'.
        3. With 3 copies, again 2 cpus are maxed, but the third show a less than 50% duty until the first completes at which point it also (almost) maxes.

          The cpu times of the 3 processes all show the ~5% concurrency overhead--probably indicative of some internal polling for resource--but the elapsed times show much greater overhead--nearly 70%.

        4. Once we get to 4 copies, the activity traces show 2 maxed and 2 well below 50% until one completes, at which point one of the other two picks up. And same again once the second completes.

          That pretty much nails it (for me) that there is some one-at-a-time resource internal to R that concurrent processes compete for. And the more there are competing, the greater the cost of that competition.

          All of which probably reflects Rs long history and its gestation in the days before multi-tasking concurrent cpu-bound processes was a realistic option.

        Note: It could be that the shared resource is something simple like the configuration file or history file or similar; and that with the right command line options to disable the use of those, the overhead can be avoided. I haven't explored this. It might be better to ask the guys that know rather than testing random thoughts.

        Whilst looking around I did come across Rmpi which might be a better approach to your problem. Though it appears to be aimed at spreading the load of single calculations over multiple cpus or boxes, rather than running multiple concurrent calculations. You'd need to read a lot more about than I've bothered with :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.