karlgoethebier has asked for the wisdom of the Perl Monks concerning the following question:
Hi all,
for learning purposes i started to think about how to parse search a very huge file using the multithreading capabilities of Perl.
As i like trivial examples, i started out with something trivial and created some huge file at first:
karls-mac-mini:monks karl$ ls -hl very_huge.file
-rw-r--r-- 1 karl karl 2,0G 23 Mai 19:38 very_huge.file
karls-mac-mini:monks karl$ tail very_huge.file
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
nose cuke karl
karls-mac-mini:monks karl$ wc -l very_huge.file
100000001 very_huge.file
By RTFM i figured out this using MCE::Grep:
#!/usr/bin/env perl
use strict;
use warnings;
use MCE::Grep;
use Data::Dump;
use Time::HiRes qw (time);
MCE::Grep::init( { max_workers => 4 } );
my $start = time;
open( my $fh, '<', 'very_huge.file' );
my @result = mce_grep { /karl/ } $fh;
close $fh;
printf "Took %.3f seconds\n", time - $start;
dd \@result;
__END__
karls-mac-mini:monks karl$ ./huge.pl
Took 29.690 seconds
["nose cuke karl\n"]
Good old grep performs very much better easily:
karls-mac-mini:monks karl$ time grep karl very_huge.file
nose cuke karl
real 0m2.563s
user 0m2.176s
sys 0m0.309s
I don't know if this trivial exercise is peinlich parallel, but i'm wondering how to:
- do this "by hand" (without using MCE::Grep)
- ...and improve the performance
Thank you very much for any hint and best regards,
Update:
Edit: Striked out nonsense.
Ouch! Perhaps more RTFM would have helped:
PID Prozessname Benutzer % CPU Physikal. Speic Virt. S
+peicher
1065 perl karl 12,7 10,3 MB 2,
+33 GB
1068 perl karl 83,7 3,9 MB 2,
+33 GB
1069 perl karl 84,6 3,9 MB 2,
+33 GB
1070 perl karl 83,5 3,9 MB 2,
+33 GB
1071 perl karl 84,0 3,9 MB 2,
+33 GB
Edit 2: Renamed the thread
Update 3: Many thanks to marioroy and BrowserUk for their patience and their contributions to this interesting thread.
Karl
«The Crux of the Biscuit is the Apostrophe»
Re: Threads From Hell #2: How To Parse A Very Huge File
by BrowserUk (Patriarch) on May 23, 2015 at 21:38 UTC
|
i started to think about how to parse a very huge file using the multithreading capabilities of Perl....and improve the performance
You won't.
Your processing is entirely limited by how fast you can read the file from disk. And there is no way to make that faster using multitasking!
Whether real threads; or green threads; or processes; the limitation is the speed of the drive, not the single thread or process doing the reading.
As your figures from the cute but pointless MCE::Grep show, using multiple cores to issue the reads, simply means that it takes longer than doing the same thing with a single thread/process. Nearly 10 times longer.
Here are a few numbers:
wc -l does nothing but read lines and count them. Ie. the minimum processing so it reflects pure IO speed:
[21:22:49.82] C:\test>wc -l big.csv
167772159 big.csv
[21:24:39.11] C:\test>dir big.csv
07/02/2015 13:27 10,737,418,241 big.csv
1:49.29 for 10,737,418,241 = 98247033/s. Let's (inaccurately) call that 98MB/s
Now let's see how long (worse case:not found) it takes to search 98MB for a 4-char string: $s = 'kar'; $s x= 32749011;;
print length $s;;
98247033
$t = time; $s =~ m[karl] and ++$n for 1 .. 1; printf "%.9f\n", time()
+- $t;;
0.192370176
$t = time; $s =~ m[karl] and ++$n for 1 .. 10; printf "%.9f\n", ( time
+() - $t ) / 10;;
0.1929563999
$t = time; $s =~ m[karl] and ++$n for 1 .. 100; printf "%.9f\n", ( tim
+e() - $t ) / 100;;
0.192800162
So, the IO rate is ~98MB/s, and the time taken to search 98MB is 0.2s. If you could spread the searching over 4 cores (without overhead) you could reduce the latter to 0.05s.
But you cannot reduce the IO time, so your best possible outcome would be 1.05s/98MB rather than 1.2s/98MB.
And if you incur any overhead at all, that 0.15 seconds saving just melts away. Incur a lot of overhead -- as MCE::Grep does -- and your overall time grows by a factor of 10.
There are 2 possibilities to speed your problem up (by a meaningful amount):
- Make the IO faster:
Possibilities include using an SSD; or spreading the file across multiple devices (per SAN/NAS etc.).
This is the same file as above, but from an SSD rather than disk: [21:54:00.65] S:\>wc -l big.csv
167772159 big.csv
[21:55:19.75] S:\>
So, 1:19.10 for 10,737,418,241 = 135744857/s call that 35% faster. A vast improvement over the 0.15s/98MB that came from multi-threading; but hardly anything to write home about.
Now the limitation is my SATA 2 interface card. A SATA 3 card would buy a little more, maybe 50% overall throughput, but again not a huge amount.
The next possibility is to use PCIe attached SSDs, which I can't demonstrate, but I've used to affect another doubling of throughput. Ie. circa 300MB/s. But it comes with all kinds of problems. First, you've got to get the data on to it, which means copying it from somewhere, usually a disk, which unless you are reusing the same data for many runs completely throws away any gains.
- If you were actually parsing -- rather than just searching -- and the parsing was very complex -- think huge nested XML -- then you might be able to cut your overall processing time by overlapping the parsing with IO waits.
That is, if you could arrange to issue the read of the next record, before parsing the current one, then you can utilise the IO wait time to good effect.
With your example, the best that would achieve, is folding the 0.2s search time into the 1s processing time, thus saving a maximum of 0.2s/98MB overall.
But the transfer of the data between the thread that reads it, and the thread that processes it, would have to be completely costless; and -- unfortunately and unnecessarily -- that is not the case for shared memory under threads::shared.
The bottom line is that for your example problem of searching a single huge file for a simple string, threading (or any other form of multi-processing) simply has nothing to offer in terms of performance gain.
Change any one of the parameter of your task -- multiple data sources; more complex data processing requirements; the need to make multiple passes over the data -- and there's a chance that multi-threading can provide some gains.
But for the single source, single pass, simple search application that you've described, there are no gains to be had from multi-tasking, regardless of whether you try kernel threads; green threads, or processes; and regardless of what language you use; or what platform you run it on.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] [select] |
Re: Threads From Hell #2: How To Parse A Very Huge File
by marioroy (Parson) on May 24, 2015 at 03:48 UTC
|
The assumptions and testing are valid. However, MCE is quite fast at this. MCE has bin/mce_grep (a parallel wrapper for the grep binary) and examples/egrep.pl (100% Perl code). Both run faster than the grep command.
$ time grep karl very_huge.file
nose cuke karl
real 0m2.127s
user 0m1.845s
sys 0m0.283s
$ time ./MCE-1.608/bin/mce_grep karl very_huge.file
nose cuke karl
real 0m1.061s
user 0m2.176s
sys 0m1.616s
$ time ./MCE-1.608/examples/egrep.pl karl very_huge.file
nose cuke karl
real 0m0.690s
user 0m2.165s
sys 0m0.362s
The MCE::Grep has an alternative mode by appending the "_f" suffix and reading the file directly. That runs in 8.5 seconds. The overhead is from calling the code block once per each line. Thus, use egrep.pl residing inside the examples directory.
# open( my $fh, '<', 'very_huge.file' );
# my @result = mce_grep { /karl/ } $fh;
# close $fh;
my @result = mce_grep_f { /karl/ } 'very_huge.file';
The following code snippet parses the 2 GiB file in 1 second.
use MCE::Loop;
use Time::HiRes qw( time );
MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } );
my $start = time;
my $pattern = 'karl';
my @result = mce_loop_f {
my ($mce, $slurp_ref, $chunk_id) = @_;
## Quickly determine if a match is found.
## Basically, only process slurped chunk if true.
if ($$slurp_ref =~ /$pattern/im) {
my @matches;
open my $MEM_FH, '<', $slurp_ref;
binmode $MEM_FH, ':raw';
while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); }
close $MEM_FH;
MCE->gather(@matches);
}
} 'very_huge.file';
print join('', @result);
printf "Took %.3f seconds\n", time - $start;
| [reply] [d/l] [select] |
|
$ time grep karl very_huge.file
nose cuke karl
real: 2.127s == user: 1.845s + sys: 0.283s ## CLOSE ENOUGH!
$ time ./MCE-1.608/bin/mce_grep karl very_huge.file
nose cuke karl
real: 1.061s != user: 2.176s + sys: 1.616s ## NOT EVEN CLOSE to: 3.
+792s
$ time ./MCE-1.608/examples/egrep.pl karl very_huge.file
nose cuke karl
real: 0.690s != user: 2.165s + sys: 0.362s ## NOR IS THIS CLOSE to:
+ 2.527s
Looks dodgy to me.
And then the claim that: The following code snippet parses the 2 GiB file in 1 second.
Let's examine that. The code: slurps the entire 2GiB file into memory and then starts 4 workers that each get a reference to (1/4 of???) the slurped data and then:
- Run a regex against their chunk of the file to see if it contains the sought string;
- And if it does, open their referenced chunk of the file as a memory file, and then re-run the regex on a line-by-line basis in order to get the whole lines that contain the sought string;
- And then they return those lines to the parent process (via a pipe??) for it to print out.
So the regex is run against 1.25 times as much data as is contained in the file, and takes "less than one second"; which is less than the 2.127 seconds the real grep takes, despite that it only processes the file's data once?
Have you heard the song: Funny numbers?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] |
|
| [reply] |
Re: Threads From Hell #2: How To Parse A Very Huge File
by marioroy (Parson) on May 24, 2015 at 04:09 UTC
|
## wc -l very_huge.file
$ time wc -l very_huge.file
100000001 very_huge.file
real 0m0.781s
user 0m0.521s
sys 0m0.260s
$ time ./MCE-1.608/examples/wc.pl -l very_huge.file
100000001 very_huge.file
real 0m0.612s
user 0m0.633s
sys 0m1.522s
## wc very_huge.file
$ time wc very_huge.file
100000001 300000003 2200000015 very_huge.file
real 0m21.621s
user 0m21.341s
sys 0m0.289s
$ time ./MCE-1.608/examples/wc.pl very_huge.file
100000001 300000003 2200000015 very_huge.file
real 0m6.210s
user 0m23.639s
sys 0m1.034s
## LANG=C wc very_huge.file
$ time LANG=C wc very_huge.file
100000001 300000003 2200000015 very_huge.file
real 0m6.860s
user 0m6.519s
sys 0m0.344s
$ time LANG=C ./MCE-1.608/examples/wc.pl very_huge.file
100000001 300000003 2200000015 very_huge.file
real 0m1.973s
user 0m6.832s
sys 0m0.904s
| [reply] [d/l] |
|
$ time wc -l very_huge.file
100000001 very_huge.file
real 0m0.781s
user 0m0.521s
sys 0m0.260s
So, user code time: 0.521 + system code time: 0.260 = real time: 0.781. Makes sense!
But then: $ time ./MCE-1.608/examples/wc.pl -l very_huge.file
100000001 very_huge.file
real 0m0.612s
user 0m0.633s
sys 0m1.522s
User code time: 0.633 + system code time: 1.522 = real time: 2.155s != 0.612s doesn't!
The other examples seem to indicate -- although the "timing information" is equally futzed -- that if you force the real wc to run more slowly, then your fake wc runs less, more slowly?
Is the lesson of your post that if you force wc to count characters -- which you can determine instantly from ls/dir -- and words -- which nobody is ever interested in; whilst also forcing it to use an 'unnatural' collating sequence that is equally of no interest, then your MCE::Grep can do what nobody wants done, less slowly than the system (which?) provided binary does, what nobody wants done?
Great advert.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] [select] |
|
"...User code time... + system code time... = real time..."
Yes:
From the docs:
"The time utility executes and times the specified utility. After the
utility finishes, time writes to the standard error stream, (in seconds):
the total time elapsed, the time used to execute the utility process and
the time consumed by system overhead."
Some observations:
karls-mac-mini:monks karl$ ls -hl very_huge10GB.file
-rw-r--r-- 1 karl karl 10G 25 Mai 00:53 very_huge10GB.file
karls-mac-mini:monks karl$ time grep karl very_huge10GB.file
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
real 2m42.126s
user 0m20.437s
sys 0m5.645s
karls-mac-mini:monks karl$ ./mce_loop.pl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
Took 150.555 seconds
#!/usr/bin/env perl
use Time::HiRes qw( time );
use feature qw(say);
my $start = time;
say qx (grep karl very_huge10GB.file);
printf "Took %.3f seconds\n", time - $start;
__END__
karls-mac-mini:monks karl$ ./wrap.pl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
Took 157.265 seconds
For the grep example 60+60+42=162 which is 2m42s. But user+sys (20+5) is 0m25s. What do i miss?
Perhaps it's too late tonight. Or too early in the morning?
Best regards, Karl
«The Crux of the Biscuit is the Apostrophe»
| [reply] [d/l] [select] |
|
|
|
|
|
MCE::Grep is not the tool for this. Calling a code block (once per line) is the reason for the overhead. The testing was done on a CentOS VM. However, the MCE::Loop example showed a way that allows Perl to run faster.
Surely, one is not likely to force a character count. I was mainly comparing the wc binary against the Perl script.
| [reply] |
|
|
|
Re: Threads From Hell #2: How To Parse A Very Huge File
by BrowserUk (Patriarch) on May 23, 2015 at 23:22 UTC
|
| [reply] |
|
"...explain what that means..."
I'll try it: A disastrous way to express my surprise to see five processes with one thread each instead of one process with five threads?
Please don't worry to much about this nonsense, i'll strike it out.
Best regards, Karl
«The Crux of the Biscuit is the Apostrophe»
| [reply] |
|
|