Having the parent read in the data and hand off each piece to the appropriate thread(s) (I'm guessing via Thread::Queue might be a good way) is the most general method that springs to my mind.
Something like this maybe?:
Problem: That will take ~4 hours to process a 3.5 GB file. And that's with the output redirected to nul so there is no competition for the disk head.
I'd probably do something similar except using processes and simple pipes, as I've often done.
So something like this, but using processes instead of threads perhaps?:
This fares better and only takes ~15 minutes to process the 3.5GB.
But all that effort is for naught as this:
C:\test>perl -pe"s[a][A]" phrases.txt >nul
processes the same 3.5GB in exactly the same way, but in less than 2 minutes.
Now the "processing" in all these examples is pretty light, just a single pass of each record, but it serves to highlight the scale of the overhead involved in distributing the records. And the scale that the processing would need to involve to make either of these distribution mechanisms viable.
The pipes mechanism is more efficient than the shared queues, and should be pretty much the same between processes as it is between threads, so I doubt there is much to be gained by going that route.
Maybe you have some mechanism in mind that will radically alter the overheads equation, but there is an awful lot of in-the-mind's-eye-expertise around here, and having spent a long time trying to find a good way to do this very thing, I'd really like to be educated by someone who's actually made it work.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
I'm not sure where to start. Your numbers seem ridiculous. Part of the reason is that your subconscious appears to be making typos to shift the numbers dramatically. Another part of the reason may be that your system is fubar somehow (or perhaps having more than 2 cores makes things run much, much slower, which I've certainly seen happen, at least in my mind's eye).
The claim that a trivial addition of Thread::Queue causes rather trivial operations to run 120x slower should certainly motivate some investigation. My investigation shows that the penalty is very close to 2x slower (rather, 2x more CPU, which translates to 'takes about the same elapsed time'), which seems reasonable based on my in-the-mind's-eye expertise.
And that's with the output redirected to nul so there is no competition for the disk head.
That seems silly to me. But, more importantly, if you'd not done that, you could've checked your work and found several mistakes, at least one of which likely would drastically change your numbers.
In any case, I didn't redirect to nul so there was all that competition for the disk head (at least in your mind's eye) and yet I see about 2x slow-down not 120x slow-down (or even the 16x..32x slow-down you'll likely see after you fix your stuff).
I don't have time to scale up to 3.5GB which means that one explanation for the off-by-a-factor-of-8x..16x discrepancy might be some extreme non-linear performance of Thread::Queue (in a case where the performance shouldn't be non-linear).
My testing does show some non-linear slow-down of Thead::Queue which can easily be at least partially overcome in this particular case, but the trend doesn't look even close to me getting the 16x over-all slow-down at 3.5GB (assuming the 'pipe' version actually runs at about the same speed as the corrected trivial 'perl -pe' version, which seems likely but I didn't look at the 'pipe' version at all and still spent more time on this than I really should have right now).
So, if I drop the 'g' from 's/a/A/g' like you did, it makes the 'perl -pe' version run about 10x faster for me. If you got the same speed-up, that'd mean it really runs at roughly the same speed as the 'pipe' version (which means you probably used shorter lines than me and got closer to a 5x speed-up).
I doubt I can replicate your problem with Thread::Queue (based on quick tests on three systems) so you should probably investigate that some more.
The non-linear slow-down of Thread::Queue is interesting and might be worth investigating. It is troubling that throwing in a bunch of 'restarts' via exec( $^X, $0 ) actually makes this code run faster (when given huge input).
I wish I had 4 cores to play with. The Thread::Queue solutions take slightly longer than the trivial one-liner by using two threads and slightly more than twice as much CPU. But that is also what happens with:
perl -pe's/a/A/g' 1kLines1.txt > output1.tmp &
perl -pe's/a/A/g' 1kLines2.txt > output2.tmp; wait
(yes, even if I redirect both to /dev/null) so using both cores has no benefit at all for this example. If I had 4 cores, I could explore that in more interesting ways.
But the level of processing here is rather trivial which seems likely to be very much unlike the original problem so most of this likely has little bearing on it (based on my saner numbers).
| [reply] [d/l] [select] |
checked your work and found several mistakes, at least one of which likely would drastically change your numbers.
Innuendo. You suggest "mistakes", but typically don't identify them so they cannot be countered.
There is one trivial mistake, -l, that has exactly NO effect on the performance.
My investigation shows ...
Typically, no code for others to try; no indication of your testing methodology; nothing but your words in support of your claims.
If you never post anything but 'authoritative testimony', you cannot be countered. Way to go with maintaining your mystique.
The non-linear slow-down of Thread::Queue is interesting and might be worth investigating.... I wish I had 4 cores to play with.
And right there in a nutshell is the endemic problem: Guesswork. Of course there is a non-linear slowdown.
But until you've used threads & queues in earnest a few times, the reason probably won't be obvious to you. Namely: lock-contention & context thrash.
Simplistically, the lock contention arises when one end of a given queue attempts to use it, when it is currently in use by the other end. So, when one of the worker threads is running on its "own" core reading from the queue, that queue is almost permanently locked. It only becomes unlocked for a brief period at the end of each loop cycle.
That means that when the file reading thread comes to write a record to that queue, it will, with a high probability, find that the queue is locked, and so enter a wait-state. That wait-state inevitably means that some other thread (or process) will inherit the core it was using and so a context switch occurs. The reader will not get another timeslice until at least that thread relinquishes it. And, it is entirely possible that it won't get a look in until several other threads have had their turn. And when it does, it may well find the queue still blocked and the cycle repeats.
Now, it should be obvious to even you how those delays could add up to produce the slow downs I indicated. And why all current parallelisation research is looking into lock-free data-structures and mechanisms.
To demonstrate, using this trivially modified form of the code above:
and a 1 million line file produced for the purpose:
perl -e"printf qq[line %6d:the quick brown fox jumps over the lazy dog
+\n], $_ for 1 .. 1e6" >phrases.small
C:\test>dir phrases.small
...
23/03/2011 03:19 57,000,000 phrases.small
...
Now the runs using 1, 2, 3, & 4 threads and queues: C:\test>junk71 -T=1 phrases.small >out.txt
Started Wed Mar 23 03:19:58 2011
1000000 19484.8213140505 at C:\test\junk71.pl line 30, <> line 1000000
+.
Ended Wed Mar 23 03:20:51 2011
C:\test>junk71 -T=2 phrases.small >out.txt
Started Wed Mar 23 03:20:59 2011
1000000 18001.4760827208 at C:\test\junk71.pl line 30, <> line 1000000
+.
Ended Wed Mar 23 03:21:55 2011
C:\test>junk71 -T=3 phrases.small >out.txt
Started Wed Mar 23 03:27:32 2011
1000000 15906.3434420465 at C:\test\junk71.pl line 30, <> line 1000000
+.
Ended Wed Mar 23 03:28:35 2011
C:\test>junk71 -T=4 phrases.small >out.txt
Started Wed Mar 23 03:30:32 2011
1000000 15088.645826608 at C:\test\junk71.pl line 30, <> line 1000000.
Ended Wed Mar 23 03:31:38 2011
And now for a single-core, sequential run: perl -MTime::HiRes=time
-E"BEGIN{warn $t=time, }"
-pe"s[a][A]g; }{ warn $./(time-$t); warn time"
phrases.small >out.txt
1300851145.745 at -e line 1.
977517.104726817 at -e line 2, <> line 1000000.
1300851146.77491 at -e line 2, <> line 1000000.
And now a little math.
The best of these is 977517.104726817 / 19484.8213140505 = 50.168132874892193598990165428789 50 times slower.
The worst of these is 977517.104726817 / 15088.645826608 = 64.784945975935041000983695239624 64 times slower.
The difference between 64 and my original figures from which you derived x120, is that these estimates are based on elapsed time, whereby my original figures are based upon actual cpu usage.
To counter your scurrilous "foobar"'d system possibility, I've had the same tests performed (albeit with a different randomly generated 1 million line dataset) on a dual core systems running both Windows/Perl 32-bit and linux, and those results are broadly in line with my results above. My friend & I hope that we will be able to repeat our provisional tests later this week using exactly the same datasets and a wider range of systems and setups, but based on our results to date, there is nothing to support your claimed "research".
As I have no idea what your "research" consisted of, because of course, you chose not to tell us. I can therefore not speculate about what mistakes or miss-assumptions you may have made whilst doing it, which is probably why you didn't tell us.
But, as there is small chance that three entirely different Perl installations--spanning compilers, OSs, hardware and continents--are "foobar"d in ways that make them exhibit strikingly similar relative performance, I'm confident to state that I think your "research" is more than just a little suspect. And your pronouncements based upon it, little more than guesswork and innuendo.
So, the next time you allude to "research" in order to support your mind's eye expertise, try making it something worthy of that epithet.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
"I didn't see any mentions of files so I wasn't going to jump to the conclusion that "huge dataset" means some single huge file or even any files at all."
Sorry, the data is in a single tab-delimited text file. I'm doing something similar to a correlation coefficient on terms assigned to genes in a large spread to generate similarity. For this, I'm using a kappa statistic, which involves pairwise comparisons between all genes and all annotations, it is a very large matrix.
Thanks for the suggestions though. I'll try implementing them and get back to the thread
| [reply] |