Re^12: (Innuendo and guesswork)

Finally! Some real data to work with.

Just so we can compare like with like, here are your data & tests run on my machine:

C:\test>perl -le"print 'a'x1024 for 1..1e5" > 1kLines.txt

C:\test>perl -pe"s/a/A/g}{warn join' ',$., times" 1kLines.txt > output
+.cmp
100000 23.805 0.374 0 0 at -e line 1, <> line 100000.

C:\test>junk71 -T=1 1kLines.txt > output.cmp
Started Thu Mar 24 03:33:09 2011
Ended Thu Mar 24 03:33:34 2011
27.097 0.873 0 0 at C:\test\junk71.pl line 38, <> line 100000.

C:\test>junk71 -T=2 1kLines.txt > output.cmp
Started Thu Mar 24 03:33:56 2011
Ended Thu Mar 24 03:34:09 2011
27.705 1.591 0 0 at C:\test\junk71.pl line 38, <> line 100000.

C:\test>junk71 -T=3 1kLines.txt > output.cmp
Started Thu Mar 24 03:34:19 2011
Ended Thu Mar 24 03:34:29 2011
28.111 1.544 0 0 at C:\test\junk71.pl line 38, <> line 100000.

C:\test>junk71 -T=4 1kLines.txt > output.cmp
Started Thu Mar 24 03:34:35 2011
Ended Thu Mar 24 03:34:43 2011
28.126 3.042 0 0 at C:\test\junk71.pl line 38, <> line 100000.
[download]

So, even with your suspiciously unrealistic choice of dataset, using 1 worker thread requires 15% more cpu & with 4, 26% more. (And that's the original point, but I'll come back to that.)

But replacing every character of every line is ... let's just say extremely unrealistic for now. So let's see what happens when we use what is only a very slightly more realistic dataset with a 50% duty cycle:

Now the differences are rather more obvious. Ranging from 36% more with 1 worker, and 93% more for 4 workers.

And how about the length of those lines? I'd bet that of the data stored in text files around the world the vast majority of it has line lengths far shorter than 1K each. Even really huge datasets with really huge 'records' tend to wrap them at some terminal-friendly limit. Eg DNA & FASTA files.

So let's do something about that also. Cut the line length to a more reasonable 100 bytes and increase the number of lines to 1e6 to maintain approximate parity with the dataset size:

C:\test>perl -le"print 'a 'x50 for 1..1e6" > 0.1kLines.txt

C:\test>perl -pe"s/a/A/g}{warn join' ',$., times" 0.1kLines.txt > outp
+ut.cmp
1000000 13.353 0.202 0 0 at -e line 1, <> line 1000000.

C:\test>junk71 -T=1 0.1kLines.txt > output.cmp
Started Thu Mar 24 04:08:03 2011
1000000 31824.8361716389 at C:\test\junk71.pl line 30, <> line 1000000
+.
Ended Thu Mar 24 04:08:39 2011
51.246 6.021 0 0 at C:\test\junk71.pl line 38, <> line 1000000.

C:\test>junk71 -T=4 0.1kLines.txt > output.cmp
Started Thu Mar 24 04:09:02 2011
1000000 14066.0824682062 at C:\test\junk71.pl line 30, <> line 1000000
+.
Ended Thu Mar 24 04:10:14 2011
84.973 71.698 0 0 at C:\test\junk71.pl line 38, <> line 1000000.
[download]

So, now with a dataset somewhat more likely to reflect the norm, we've got 1 worker taking 422% and 4 workers taking 1155% more. That's 4 to 11 times as long. Still a far cry from "my 120x" I hear you cry, but bear with me.

When I came to appraise the possibilities I looked for an existing file of roughly the right size and found a file I'd generated for some similar purpose. 5GB of random 'phrases' in typical 'text file sized' lines. Perfect!:

C:\test>dir phrases.txt
28/12/2010  00:09     5,033,440,760 phrases.txt

C:\test>head phrases.txt
hexagonal monstrance fronded repand trouped modelers
fragged foresighted nescient epilogue athwart venging brickyards
jeweled manometer telium telegony apparitors imperfects deontic impuni
+ty
totemists lire summered nighthawks leucocytes bronchodilators
fissionable collapses luminances masquer notated
giaour overhanded outtraveled moneys dourines undauntedly cockaded sub
+ventions
interrogatives ya prereform tattooing tablemate thunderclouds
numbats preprogram pulque palanquin hobbled
locoing faced overshot nickeled minimum pronunciations pressingly eagl
+ing
lovages fatalities churnings barned treys bombproof refillable
C:\test>perl -pe"s/a/A/g}{warn join' ',$., times" phrases.txt > output
+.cmp

C:\test>perl -pe"s/a/A/g}{warn join' ',$., times" phrases.txt > nul
100000000 159.542 7.878 0 0 at -e line 1, <> line 100000000.

C:\test>junk71 -T=1 phrases.txt > output.cmp
Started Thu Mar 24 04:26:07 2011
1000000 28259.7636234697 at C:\test\junk71.pl line 30, <> line 1000000
+.
2000000 28300.5518302086 at C:\test\junk71.pl line 30, <> line 2000000
+.
3000000 28301.3527797113 at C:\test\junk71.pl line 30, <> line 3000000
+.
4000000 28299.3505386885 at C:\test\junk71.pl line 30, <> line 4000000
+.
5000000 28287.2627896293 at C:\test\junk71.pl line 30, <> line 5000000
+.
Terminating on signal SIGINT(2)

C:\test>junk71 -T=4 phrases.txt > output.cmp
Started Thu Mar 24 04:32:15 2011
1000000 14366.7839914658 at C:\test\junk71.pl line 30, <> line 1000000
+.
2000000 14388.0751686352 at C:\test\junk71.pl line 30, <> line 2000000
+.
3000000 14387.2471354457 at C:\test\junk71.pl line 30, <> line 3000000
+.
4000000 14366.3711996768 at C:\test\junk71.pl line 30, <> line 4000000
+.
5000000 14356.1826309399 at C:\test\junk71.pl line 30, <> line 5000000
+.
Terminating on signal SIGINT(2)
[download]

And so I saw the one-liner complete in (well) under 2 minutes. And from the very consistent lines/sec output, I projected (100e6 / 14387) = 6950 seconds or 115 minutes. If you (care to) remember that my original post contained no timing code, so the timing was done rather more crudely. Hence my factor of two exaggeration.

But, and here is the real point. Never in any of these tests, yours or mine, your dataset or mine, have any of us found a threads & queues solution that achieves a performance gain. And anything less than faster is a failure. Simply not worth the effort.

Not once. Not even your highly, (some might even say suspiciously), unrealistic dataset and duty cycle. Indeed, the more threads and queues you throw at the problem, the longer it damn well takes.

I'll let others decide for themselves about who followed a more realistic test scenario in the absence of any specifics about the OPs actual data and processing--which I asked for but never received.

Bottom line: A threads & queues solution will never solve the OPs problem. And I seriously doubt, but don't have the expertise to know, nor your code to try, that a forks & pipes solution will achieve much either. Assuming realistic datasets and duty cycles.

And the whole point of my posting some timing info was to force you to try and counter it. So the next time you get tempted to answer the OPs question with an untried, mind's eye solution:

Create the threads first and then have each thread load just the data it needs (and don't share it, of course). Then there won't be extra copies of that stuff created.

And only when pressed for more detail, did you 'fine tuning' that piece of off-the-cuff, based-on-nothing-but-what-you'd-read-somewhere, nonsense with more guesswork:

Having the parent read in the data and hand off each piece to the appropriate thread(s) (I'm guessing via Thread::Queue might be a good way) is the most general method that springs to my mind. I'd probably do something similar except using processes and simple pipes, as I've often done.

If you have neither the experience to know, nor the time to test your theories a little first, maybe you'll think twice, and so avoid wasting the OPs time pursuing impossible 'solutions'.

That's me done. Last word is yours.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^12: (Innuendo and guesswork) Select or Download Code