Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Threads From Hell #2: How To Parse A Very Huge File

by BrowserUk (Patriarch)
on May 23, 2015 at 21:38 UTC ( [id://1127542]=note: print w/replies, xml ) Need Help??


in reply to Threads From Hell #2: How To Search A Very Huge File [SOLVED]

i started to think about how to parse a very huge file using the multithreading capabilities of Perl....and improve the performance

You won't.

Your processing is entirely limited by how fast you can read the file from disk. And there is no way to make that faster using multitasking!

Whether real threads; or green threads; or processes; the limitation is the speed of the drive, not the single thread or process doing the reading.

As your figures from the cute but pointless MCE::Grep show, using multiple cores to issue the reads, simply means that it takes longer than doing the same thing with a single thread/process. Nearly 10 times longer.

Here are a few numbers:

wc -l does nothing but read lines and count them. Ie. the minimum processing so it reflects pure IO speed:

[21:22:49.82] C:\test>wc -l big.csv 167772159 big.csv [21:24:39.11] C:\test>dir big.csv 07/02/2015 13:27 10,737,418,241 big.csv

1:49.29 for 10,737,418,241 = 98247033/s. Let's (inaccurately) call that 98MB/s

Now let's see how long (worse case:not found) it takes to search 98MB for a 4-char string:

$s = 'kar'; $s x= 32749011;; print length $s;; 98247033 $t = time; $s =~ m[karl] and ++$n for 1 .. 1; printf "%.9f\n", time() +- $t;; 0.192370176 $t = time; $s =~ m[karl] and ++$n for 1 .. 10; printf "%.9f\n", ( time +() - $t ) / 10;; 0.1929563999 $t = time; $s =~ m[karl] and ++$n for 1 .. 100; printf "%.9f\n", ( tim +e() - $t ) / 100;; 0.192800162

So, the IO rate is ~98MB/s, and the time taken to search 98MB is 0.2s. If you could spread the searching over 4 cores (without overhead) you could reduce the latter to 0.05s.

But you cannot reduce the IO time, so your best possible outcome would be 1.05s/98MB rather than 1.2s/98MB.

And if you incur any overhead at all, that 0.15 seconds saving just melts away. Incur a lot of overhead -- as MCE::Grep does -- and your overall time grows by a factor of 10.

There are 2 possibilities to speed your problem up (by a meaningful amount):

  • Make the IO faster:

    Possibilities include using an SSD; or spreading the file across multiple devices (per SAN/NAS etc.).

    This is the same file as above, but from an SSD rather than disk:

    [21:54:00.65] S:\>wc -l big.csv 167772159 big.csv [21:55:19.75] S:\>

    So, 1:19.10 for 10,737,418,241 = 135744857/s call that 35% faster. A vast improvement over the 0.15s/98MB that came from multi-threading; but hardly anything to write home about.

    Now the limitation is my SATA 2 interface card. A SATA 3 card would buy a little more, maybe 50% overall throughput, but again not a huge amount.

    The next possibility is to use PCIe attached SSDs, which I can't demonstrate, but I've used to affect another doubling of throughput. Ie. circa 300MB/s. But it comes with all kinds of problems. First, you've got to get the data on to it, which means copying it from somewhere, usually a disk, which unless you are reusing the same data for many runs completely throws away any gains.

  • If you were actually parsing -- rather than just searching -- and the parsing was very complex -- think huge nested XML -- then you might be able to cut your overall processing time by overlapping the parsing with IO waits.

    That is, if you could arrange to issue the read of the next record, before parsing the current one, then you can utilise the IO wait time to good effect.

    With your example, the best that would achieve, is folding the 0.2s search time into the 1s processing time, thus saving a maximum of 0.2s/98MB overall.

    But the transfer of the data between the thread that reads it, and the thread that processes it, would have to be completely costless; and -- unfortunately and unnecessarily -- that is not the case for shared memory under threads::shared.

The bottom line is that for your example problem of searching a single huge file for a simple string, threading (or any other form of multi-processing) simply has nothing to offer in terms of performance gain.

Change any one of the parameter of your task -- multiple data sources; more complex data processing requirements; the need to make multiple passes over the data -- and there's a chance that multi-threading can provide some gains.

But for the single source, single pass, simple search application that you've described, there are no gains to be had from multi-tasking, regardless of whether you try kernel threads; green threads, or processes; and regardless of what language you use; or what platform you run it on.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1127542]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-03-29 10:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found