in reply to threading a perl script
Contrary to popular opinion, there is some scope for performance gains through threading with this pattern of application usage.
Whether those performance gains are realisable, or worth the effort of doing so, depends entirely upon what DO STUFF FOR EACH LINE actually consists of?
For example, in the following, the one-liner simply reads a 1GB/16 million line file with no further processing. This forms a base line for how fast you can get the data from disk into memory:
C:\test>timeit \perl64\bin\perl.exe -nle1 1GB.dat Took: 9.687471720 seconds C:\test>timeit \perl64\bin\perl.exe -nle1 1GB.dat Took: 9.544258000 seconds C:\test>timeit \perl64\bin\perl.exe -nle1 1GB.dat Took: 9.708372520 seconds
As you can see, on my system that baseline is reasonably consistently something just under 10 seconds.
In the following, reading the same file, but this time performing about the simplest search and action possible:
C:\test>timeit \perl64\bin\perl.exe -nle\"/\*/ and ++$c }{ print $c\" +1GB.dat 10217571 Took: 11.682412240 seconds C:\test>timeit \perl64\bin\perl.exe -nle\"/\*/ and ++$c }{ print $c\" +1GB.dat 10217571 Took: 11.904963960 seconds C:\test>timeit \perl64\bin\perl.exe -nle\"/\*/ and ++$c }{ print $c\" +1GB.dat 10217571 Took: 12.945696440 seconds C:\test>timeit \perl64\bin\perl.exe -nle\"/\*/ and ++$c }{ print $c\" +1GB.dat 10217571 Took: 12.128623440 seconds
There is consistently 2 to 2.5 seconds of processing time. In theory, if you could split that processing time across 2 cores, you could reduce the runtime by around a second.
And that is a conservative estimate, because most of the time attributed to IO, is time spent waiting with an idle cpu. so, if you could utilise that idle cpu to perform the processing of the current record, whilst the system fetches the next, you could (theoretically) get the elapsed time back down close to the IO baseline and save the full 2 to 2.5 seconds.
That said, achieving that using Perl's current threading mechanisms--it is the state sharing mechanisms that are the bottleneck--is very difficult, bordering on the impossible. It is quite difficult to achieve in any language not just Perl.
And to achieve it requires that you make a very simple program considerably more complicated. So you have to ask yourself, is a (at best) 2.5 second saving on 12.5 seconds worth that effort and complication? For the case of the simple example above, almost certainly not. For a 20% saving to be of significant value it would have to be a pretty big file.
But, if the processing of each line is substantially more complex, sufficient to raise the IO to cpu ratio significantly, and the files were large enough to make the overlapping of those two components worth the effort involved, then there is some scope for doing that using threading.
Mind you, it would be simpler still if Perl gave us access to the asynchronous/overlapped IO facilities available in most modern OSs, but that is a different discussion.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: threading a perl script
by vkon (Curate) on Apr 23, 2011 at 06:12 UTC | |
by BrowserUk (Patriarch) on Apr 23, 2011 at 13:18 UTC | |
by vkon (Curate) on Apr 24, 2011 at 07:45 UTC | |
by BrowserUk (Patriarch) on Apr 24, 2011 at 10:20 UTC | |
by vkon (Curate) on Apr 24, 2011 at 17:42 UTC | |
| |
by Anonymous Monk on Apr 24, 2011 at 08:03 UTC | |
by vkon (Curate) on Apr 24, 2011 at 08:45 UTC | |
|