BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:
This is not directly Perl related, but might become so.
I'm looking for examples and/or discussion--practical or theoretical--of the problems, and hopefully solutions, involved in threading inherently recursive algorithms.
Any and all leads will be gratefully received and vigorously pursued.
Thanks.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: [OT]: threading recursive subroutines.
by dHarry (Abbot) on Feb 02, 2011 at 11:22 UTC | |
I don't do much with threads, what I use is standard stuff in the JEE environment. Java handles things a bit different I think (making some sacrifices for the sake of simplicity). I recently stumbled upon Speculative parallelisation. Maybe it is of use to you. Do you have a specific type of problem in mind? I myself am thinking about reducing a complex scheduling problem by spawning threads to optimize things locally and later combine the results into a global solution. I have indications that I might get away with such an approach. I haven't make much progress though. Cheers, Harry | [reply] |
by BrowserUk (Patriarch) on Feb 02, 2011 at 23:18 UTC | |
Thank you for the Speculative parallelisation link, It is exactly the type of response I was looking for. As the authors note in their for abstract: Prior research has focused mainly on parallelising loops. Recursive procedures, which are also frequently used in real-world applications, have attracted much less attention Their list of references may also lead to some further interesting stuff, though all to often I hit the road block of no longer having access to the AoCM and IEEE repositories :( But just often enough the authors of such papers make them available on their own websites for free, and these are much easier to find once you have titles and keywords to look for. Do you have a specific type of problem in mind? My sample case is Ackermann–Péter variation of the Ackermann function, chosen specifically because it doesn't lend itself to trivial recursion elimination, but can be relatively easily parallelised manually. Other examples are any number of inherently recursive algorithms like binary search, tree manipulations, sorts et al. I'm not really sure what my goal is. Having again just had inordinate difficulty in trying to parallelise a recursive algorithm, and recognised that I went through the same dead ends and misdirections that I went through on previous occasions when trying to do the same things for different algorithms, I got to wondering if anyone had done any research into the subject, and my attempted searches turned up a whole lot of nothing. It feels like there should be a set of steps and constraints one could follow to approach the problem, but from my research these do not yet exist. Which of course could mean that they could not exist. But with the lack of research on the subject, it could also mean that they just haven't been looked for; or that they are not easy to find. So, at this point, I'm just looking thank you :) For example. It seems to me that for a recursive algorithm--that is not tail recursive--to be a suitable candidate for parallelisation, it requires that it has multiple--two or more--recursive calls at each level. Many do. This is an essential signature of a whole group of divide-and-conquer algorithms. So, I'm not sure where I am going with this, but you have my profound thanks for helping me on my way :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re: [OT]: threading recursive subroutines.
by ikegami (Patriarch) on Feb 02, 2011 at 19:58 UTC | |
Good question! I've been working on it to see what the real issue is. I don't have an answer for you; I'm just documenting what I've found for everyone's benefit. First, I presume that "inherently recursive" means "cannot be rewritten to be tail-recursive". Fibonacci is an example of an inherently recursive algorithm.
At face value, it's quite easy to make it threaded:
But what if you wanted to limit the number of threads (perhaps to limit overhead)? One solution would be to have a pool of threads and to use them if available.
(The above assumes closures can be shared, but it can be rewritten to not use closures.) When implementing async_maybe (and the get of the object it returns), one must be extra careful to avoid the situation where a thread is waiting to have it's result collected. But what if you want a worker model (perhaps to distribute the work to other machines)? Now, that's hard. One would need some kind of a callback system.
The TODO item is hard to do cleanly. And of course, I'm still using closures even though I think that's not an option for threads threads. Rewriting to avoid closures will likely make the code even longer. | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Feb 03, 2011 at 03:58 UTC | |
Whilst I think that your discussion of the problems is correct, I think your choice of algorithm and your code for demonstrating the problems is problematic. The latter in part because you are using Perl, which I specifically excluded; in part because you then use Coro which IMO clouds the real issues and adds more of its own. Firstly, Fibonacci is a poor example because it is simply so easy to achieve better performance through mechanisms other than threading. As the only purpose of threading in this context is improving performance, using them for Fibonacci is never going to work effectively. Whilst the recursive Fibonacci algorithm doesn't lend itself to tail call elimination, it is easily written iteratively for an huge performance gain. Equally, the recursive algorithm can be sped up immensely by simply memoising it. But more importantly, the iteration/recursion can easily be done away with entirely using Binet's formula: <Reveal this spoiler or all in this thread>
All of these modifications make it possible to calculate Fibonacci( 1474 ) (the limit of a doubles ability to store the result), more quickly than it is possible to spawn a thread. At least using the current implementation of iThreads. And possibly Coro also. I realise that you chose it only as a well-known example on which to base your discussion. But if you are going to explore the threading of recursion in Perl, there are simpler ways than your implementations. This does the job:
Where the futures module is:
Using three subs serves to highlight the level dependency, but these could be merged into one sub with conditional statements provided there was some way to determine the number of existing threads. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by Corion (Patriarch) on Feb 03, 2011 at 09:43 UTC | |
One idea that I find interesting in practice (but not from an algorithmical point of view) is Grand Central Dispatch, basically the idea of having a worker pool of threads (likely about 2x cores) that get managed by the OS. This approach seems to take the burden of thread administration away from the program and stuffs it into a centralized infrastructure that makes sure that no more than a set limit of threads is running. It seems that Grand Central Dispatch suffers from different-but-still-ugly locking problems, but that's expected. I think that the idea of "futures" has merit - AnyEvent uses something like them in its AnyEvent->condvar mechanism. AnyEvent has at least some way of detecting deadlocks, but as it is basically single-threaded, detecting deadlocks is easier there I presume. Having transparent "futures" in Perl is feasible through trickery with transparent proxy objects like you present or Win32::OLE and MozRepl::RemoteObject implement in a more general fashion. I think that the performance hit you incur by heavily using tied variables will again negate many of the speed gains you get. I'm also not sure whether having transparent futures will actually help much, because you still need to make really sure that you don't immediately fetch the results after starting a calculation. If there is a way to detect the dependency chain of futures ahead of time, implementing a thread pool together with the futures could limit the amount of threads spawned with your implementation. But I'm not sure whether that's possible as each future could spawn another future it depends on and thus quickly overrun each fixed limit. But maybe using Coro together with threads could allow switching the context without starting a fresh thread once we run out of CPU threads. But mixing Coro and threads is something that's very unlikely to work... One of my "easier" thought experiments is the idea of switching Perl to asynchronous IO. I think this is somewhat easier to analyze, as IO iterators are used more often than threads are used, and their use does not require deep understanding and (dead-)locking operations are unnecessary. For example, a "lazy file" could work with your/a "future" implementation by pretending to have read a line from a file, until somebody looks at the value:
Of course, most (existing) Perl code will simply not benefit from asynchronous IO, as most code will read a line from a file and immediately inspect each value. This will simply negate all benefits we gain from freeing up Perl from waiting for the line to be fetched from the file. Maybe we could gain something by requesting more than the one line (or more than the one buffer) to be read from the file, but that will likely get problematic if we try to use tell and other methods for looking at files. My conclusion is that implicit and transparent parallelism will likely simply not work because the existing procedural programs will not make use of it. So in any case, specialization from the straightforward approach towards a more parallel approach is necessary (for Perl programs). The likely "easy" solution are callback-oriented or continuation-passing models using closures like AnyEvent or Node.js. There you start the asynchronous call and give it a callback to execute once the operation completes. | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Feb 03, 2011 at 10:30 UTC | |
by Corion (Patriarch) on Feb 03, 2011 at 11:43 UTC | |
by ikegami (Patriarch) on Feb 03, 2011 at 08:53 UTC | |
No, that does not limit the number of workers.
It doesn't seem to be a technique you can use for distributed work, which is where I thought you were going with this.
The point was to find a technique. Or are you saying the technique will never be as efficient as alternatives? Anyway, I saw your post about the Ackermann function, and that's a whole different ballgame. I spent a lot of time tinkering with that too after you mentioned it. As far as I know, one can't make y = f(x); z = f(y); parallel without refactoring by a human, yet the Ackermann function is of that form. That said, it does have interesting traits that may make parallisation possible (or impossible). But I'm in way over my head. | [reply] [d/l] |
by BrowserUk (Patriarch) on Feb 03, 2011 at 09:56 UTC | |
by BrowserUk (Patriarch) on Feb 04, 2011 at 00:07 UTC | |
by ikegami (Patriarch) on Feb 02, 2011 at 21:21 UTC | |
Turned out that using closures, while providing a simple design, actually caused the problem. The design below still needs work because it passes closures around, but it works in concept as shown by the Coro implementation below. (It probably defies the point to use Coro, but it's easier to prototype using it.)
The following belongs at the top of the above script: Read more... (3 kB)
| [reply] [d/l] [select] |
|
Re: [OT]: threading recursive subroutines.
by chrestomanci (Priest) on Feb 02, 2011 at 13:22 UTC | |
I know almost nothing about functonal programing, but I think that you would gain some insight from looking at how functional programing languages handle threads and recurson. Consider the cannonical example of qsort in Haskell:
The way I read that code, the list get split into those elements larger than the pivot and those smaller than the pivot. Each list is recursively sorted, and the Haskell runtime is free to do those two sorts in parallel. | [reply] [d/l] |
by BrowserUk (Patriarch) on Feb 03, 2011 at 00:05 UTC | |
First. Thanks for your considered response. I think that this much quoted example of Haskell is actually something of a trompe d'oil. I'll try to explain that. The reason for parallelising this would be to improve performance. But, the reality is that the first and best way of improving sort performance in Haskell, it to not use this cute and (still) hugely, visually impressive implementation. The problem is, that whilst the concise clarity, and relatively easy to understand semantics of this list comprehension solution are an obvious advert for Haskell and a good teaching aid, it tricks the eye into believing that it is an efficient implementation. This is far from the case. The problem is that given the huge disparity of speed between on chip and off-chip memory access, memory allocation and management costs are a significant factor in any cpu-intensive algorithm. And this implementation exacerbates those costs by effectively having to not just duplicate the memory allocations at every level--by creating two new list from each previous list. But it also has to concatenate these sublists back into (new) composite lists. See the Wikipedia discussion for more detail. If you read to the end, you'll see that Haskell has now moved to using Merge sort. In part, in order to counter this problem. the Haskell runtime is free to do those two sorts in parallel. This is the crux of the threading recursion problem. If the runtime decided to always perform the two sorts in parallel, then it would end up spawning one thread for every item in the list to be sorted. This is obviously not a practical solution. That means that to (automatically) parallelise the algorithm, the runtime would need to decide how many levels deep into the recursion it should allow the parallelisation to go before resorting to straight forward recursion in order to bound that parallelisation. Let's say that the runtime discovered how may threads of execution are available and can rewrite the algorithm such that once that limit has been reached, it resorts to simple recursion. The problem still remains that the used-only-once nature of Haskell variables is going to impose significant memory management costs upon the algorithm. In all probability, sufficient costs to negate most of the benefits of the parallelisation. Given that a quick sort can be implemented to run in-place--albeit that Haskell requires the breaking of its own rules to so--the benefits of doing so and thus avoiding the memory management overheads, far outweigh the benefits of parallelising the list comprehension algorithm. And once the quick sort is performed in-place, parallelising that algorithm is both easier, and produces significantly better performance improvement multiples. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re: [OT]: threading recursive subroutines.
by locked_user sundialsvc4 (Abbot) on Feb 02, 2011 at 15:20 UTC | |
Threading buys you only two things: Recursion, on the other hand, is a property of an algorithm. Furthermore, it is significant to observe that recursion is not parallel. The recursive iteration must complete, and deliver its result, before the outer iteration(s) may proceed. Many algorithms that can be expressed in a recursive fashion, also can be expressed in a massively parallel fashion ... but traditional threading models are not “massive parallelism,” a term that is usually reserved for talking about things like array (co)processors. | |
by BrowserUk (Patriarch) on Feb 02, 2011 at 22:37 UTC | |
Once again, you offer a few authoritative sounding 'wisdoms' in place of 'an answer'. And as usual, on close inspection, they not only are of no value in terms of answering the question asked; they are in large part fundamentally incorrect. Times they Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by locked_user sundialsvc4 (Abbot) on Feb 03, 2011 at 01:07 UTC | |
You really just don’t like me, do you? ... | |
by BrowserUk (Patriarch) on Feb 03, 2011 at 02:20 UTC | |
by Limbic~Region (Chancellor) on Feb 10, 2011 at 19:10 UTC | |
| |
|
Re: Ackermann dependency graph
by ikegami (Patriarch) on Feb 03, 2011 at 22:54 UTC | |
I built a dependency graph for Ackermann. Don't know if it helps.
╔═════╦═════╦═════╦═════╗
║ m= ║ 1 ║ 2 ║ 3 ║
╠═════╬═════╬═════╬═════╣
║ n=0 ║ 2 ║ ║ ║
╚═════╬══v══╣ ║ ║
║ 3 > 3 ║ ║
╠══v══╬══v══╣ ║
║ 4 ║ ║ ║
╠══v══╣ ║ ║
║ 5 > 5 > 5 ║
╠══v══╬══v══╬══v══╣
║ 6 ║ ║ ║
╠══v══╣ ║ ║
║ 7 > 7 ║ ║
╠══v══╬══v══╣ ║
║ 8 ║ ║ ║
╠══v══╣ ║ ║
║ 9 > 9 ║ ║
╠══v══╬══v══╣ ║
║ 10 ║ ║ ║
╠══v══╣ ║ ║
║ 11 > 11 ║ ║
╠══v══╬══v══╣ ║
║ 12 ║ ║ ║
╠══v══╣ ║ ║
║ 13 > 13 > 13 ║
╠══v══╬══v══╬══v══╣
║ 14 ║ ║ ║
╠══v══╣ ║ ║
║ 15 > 15 ║ ║
╠══v══╬══v══╣ ║
║ 16 ║ ║ ║
╠══v══╣ ║ ║
║ 17 > 17 ║ ║
╠══v══╬══v══╣ ║
║ 18 ║ ║ ║
╠══v══╣ ║ ║
║ 19 > 19 ║ ║
╠══v══╬══v══╣ ║
║ 20 ║ ║ ║
╠══v══╣ ║ ║
║ 21 > 21 ║ ║
╠══v══╬══v══╣ ║
║ 22 ║ ║ ║
╠══v══╣ ║ ║
║ 23 > 23 ║ ║
╠══v══╬══v══╣ ║
║ 24 ║ ║ ║
╠══v══╣ ║ ║
║ 25 > 25 ║ ║
╠══v══╬══v══╣ ║
║ 26 ║ ║ ║
╠══v══╣ ║ ║
║ 27 > 27 ║ ║
╠══v══╬══v══╣ ║
║ 28 ║ ║ ║
╠══v══╣ ║ ║
║ 29 > 29 > 29 ║
╚═════╩═════╩═════╝
Each box along the vertical axis is one "n" higher, so A(3,2)=A(2,13)=A(1,27)=29. | [reply] |
|
Re: Non-recursive Ackermann
by ikegami (Patriarch) on Feb 04, 2011 at 00:31 UTC | |
Using the dependency graph I posted earlier, I was able to construct a non-recursive implementation of the Ackermann function. It visits the graph left to right, top to bottom.
Maybe you'll have better luck with that. Some properties:
m<4 can be special-cased. Known formula exist for A(0,n), A(1,n), A(2,n) and A(3,n). Update: Cleaned up code a bit. | [reply] [d/l] |
by BrowserUk (Patriarch) on Feb 04, 2011 at 09:52 UTC | |
I was able to construct a non-recursive implementation of the Ackermann function. Impressive work! Maybe you'll have better luck with that. I couldn't see much opportunity for parallisation in your first version. Update: Cleaned up code a bit. That's certainly easier to follow. (*) The 'obvious' thing would be to async the inner for loop. Except that with maximum realistic values of m being 5, starting a thread for 5 iterations is nonsense. So then my attention moves to the outer loop, with the thought that you can often split the range of a loop into subranges and run them in parallel. But, firstly we don't know the range. Secondly, even iteratively, the dependency of each subsequent iteration upon the previous would require lock-stepping the threading--either through complex locking or queuing--and with the minimal effort required for each step of each iteration versus the cost of locking and context switching, again make that a non-starter. Upshot: I think you called it earlier with your z = F(x); return F(z); comment that went over my head at the time. In the light of that, I've modified my first tentative rule from a post above to be: For a recursive algorithm--that is not tail recursive--to be a suitable candidate for parallelisation, it requires that it has multiple, independant, recursive calls at each level. The interesting thing now is: how many recursive algorithms fit that pattern? m < 4 can be special-cased. Known formula exist for A(0,n), A(1,n), A(2,n) and A(3,n). Indeed. The following returns all the tractable cases within the machine capability in a blink of an eye:
But that now means I need to find a more suitable testcase for my experiments. An in-place, parallised qsort maybe? (*)Though wonder why you do a loop initialisation rather than my @n = (-1 ) x $m; my @A = ( 1 ) x $m; given that m will realistically max out at 5. And whilst I'm on unimportant, purely subjective matters, why for(;;) not while( 1 )? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by ikegami (Patriarch) on Feb 04, 2011 at 16:24 UTC | |
[ Just a few rushed little comments ] Thanks!
I reached the same conclusions.
That's why I used fib at first and called Ackermann a whole other ball game.
Any divide and conquer algorithm, for starters. That's where the dependency graph came in. I wanted to see if there was a split in it that could be exploited.
I didn't know what I was going to end up with when I started. The code got refactored multiple times. I didn't micro-optimise. (Upd: Put differently, the snippet is a theoretical implementation, not a practical one. Work still needs to be done to make it practical. Well, as practical as Ackermann can be. )
It's what the code I learned from used. I've never used while (1). I like for (;;) cause I read it as "for ever". Trivia: Since the entire bound check of while (CONSTANT) gets optimised away, and since "while" and C-style "for" use the same opcodes, while (1) and for (;;) are practically identical internally. | [reply] [d/l] [select] |