Perl Monk, Perl Meditation | |
PerlMonks |
Re: Threads From Hell #2: How To Parse A Very Huge Fileby BrowserUk (Patriarch) |
on May 23, 2015 at 21:38 UTC ( [id://1127542]=note: print w/replies, xml ) | Need Help?? |
i started to think about how to parse a very huge file using the multithreading capabilities of Perl....and improve the performance You won't. Your processing is entirely limited by how fast you can read the file from disk. And there is no way to make that faster using multitasking! Whether real threads; or green threads; or processes; the limitation is the speed of the drive, not the single thread or process doing the reading. As your figures from the cute but pointless MCE::Grep show, using multiple cores to issue the reads, simply means that it takes longer than doing the same thing with a single thread/process. Nearly 10 times longer. Here are a few numbers: wc -l does nothing but read lines and count them. Ie. the minimum processing so it reflects pure IO speed:
1:49.29 for 10,737,418,241 = 98247033/s. Let's (inaccurately) call that 98MB/s Now let's see how long (worse case:not found) it takes to search 98MB for a 4-char string:
So, the IO rate is ~98MB/s, and the time taken to search 98MB is 0.2s. If you could spread the searching over 4 cores (without overhead) you could reduce the latter to 0.05s. But you cannot reduce the IO time, so your best possible outcome would be 1.05s/98MB rather than 1.2s/98MB. And if you incur any overhead at all, that 0.15 seconds saving just melts away. Incur a lot of overhead -- as MCE::Grep does -- and your overall time grows by a factor of 10. There are 2 possibilities to speed your problem up (by a meaningful amount):
The bottom line is that for your example problem of searching a single huge file for a simple string, threading (or any other form of multi-processing) simply has nothing to offer in terms of performance gain. Change any one of the parameter of your task -- multiple data sources; more complex data processing requirements; the need to make multiple passes over the data -- and there's a chance that multi-threading can provide some gains. But for the single source, single pass, simple search application that you've described, there are no gains to be had from multi-tasking, regardless of whether you try kernel threads; green threads, or processes; and regardless of what language you use; or what platform you run it on. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
In Section
Seekers of Perl Wisdom
|
|