Re^7: Threads From Hell #2: How To Parse A Very Huge File

The OP seemed interested if parallelism is possible for such a task. Please disregard my posts if I have thought wrong. In the spirit of parallelism, I tested a 20 GiB file under the host OS (laptop with 16 GiB) comparing the grep command, bin/mce_grep, examples/egrep.pl and the script using MCE::Loop.

Recap: bin/mce_grep is a parallel wrapper for the grep command; examples/egrep.pl is 100% Perl code.

I am getting the impression that you're not liking MCE. If that is the case, then I should refrain from posting here. Have you not tried MCE against your 10 GiB file; e.g. bin/mce_grep or examples/egrep.pl?

$ ls -lh very_huge.file
-rw-r--r--  1 mario  staff    20G May 24 14:53 very_huge.file

## grep command

$ time grep karl very_huge.file
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl

real    6m47.048s    ( 407 seconds )
user    6m42.372s
sys     0m 4.669s

## bin/mce_grep

$ time ./MCE-1.608/bin/mce_grep karl very_huge.file
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl

real    2m17.003s    ( 137 seconds )
user   17m 9.223s
sys     0m33.223s

## examples/egrep.pl

$ time ./MCE-1.608/examples/egrep.pl karl very_huge.file
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl

real    0m26.447s
user    0m22.527s
sys     0m 8.459s

## MCE::Loop script

$ time ./mce_loop_script.pl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
Took 25.650 seconds

real    0m25.764s
user    0m42.494s
sys     0m 7.264s
[download]

Below, the script using MCE::Loop.

use MCE::Loop;
use Time::HiRes qw( time );

MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } );

my $start = time;
my $pattern = 'karl';

my @result = mce_loop_f {
   my ($mce, $slurp_ref, $chunk_id) = @_;

   ## Quickly determine if a match is found.
   ## Basically, only process slurped chunk if true.

   if ($$slurp_ref =~ /$pattern/im) {
      my @matches;

      open my $MEM_FH, '<', $slurp_ref;
      binmode $MEM_FH, ':raw';
      while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); }
      close   $MEM_FH;

      MCE->gather(@matches);
   }

} 'very_huge.file';

print join('', @result);

printf "Took %.3f seconds\n", time - $start;
[download]

I have taken the time to answer the OP's request -- free time. It is not worth it anymore at this site, especially when you (being at the Pope level) seem to disprove of MCE.

Best regards to all, -mario

Comment on Re^7: Threads From Hell #2: How To Parse A Very Huge File Select or Download Code

Replies are listed 'Best First'.
Re^8: Threads From Hell #2: How To Parse A Very Huge File by BrowserUk (Patriarch) on May 24, 2015 at 20:15 UTC
I tested a 20 GiB file under the host OS Okay. Let's do a little math: grep: 21474836480 / 407 * 8 = 422109800 == 422Mbits/s. That is very fast. Way faster than my brand new disk and SSD; and equals the performance of the PCIe ssds on of my clients recently fitted to their servers. Very fast, but believable. mce_grep: 21474836480 / 137 * 8 = 1254005048 == 1.2Gbits/s. That is faster than any single device or interface that I have heard of. egrep.pl:21474836480 / 26.477 * 8 = 6495961426 == 6.5Gbits/s. That's getting up there with the bandwidth of the PCI Express 3.1 specifications (8GT/s); but as yet there are no devices available that support that! mce_loop_script: 21474836480 / 25.650 * 8 = 6697804750 == 6.7Gbits/s. That would give the Intel QuickPath Interconnect processor internal bus a run for its money on some of the low-powered, low clock-speed processors. Sorry, but unless you have this file distributed across multiple spindles attached via multiple 16-lane PCIe cards; or maybe you're using a system that has 32GB of ram and you're pre-caching the file there as you were earlier; those numbers just don't add up. especially when you (being at the Pope level) seem to disprove of MCE. I don't disapprove of MCE. I can see that for tasks where the IO is a small part of the overall processing time, -- example: fuzzy searching for many substrings against huge DNA sequences -- MCE provides a much needed solution for distributing the processing against a common dataset that threads (because of the slowness and gratuitous memory usage of threads::shared) simply doesn't have a good solution to. For those types of processing, MCE is a breath of fresh air, and I applaud you for it. But the numbers you are posting for this single file, single pass, simple search application seem to defy the laws of Physics. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^9: Threads From Hell #2: How To Parse A Very Huge File by marioroy (Prior) on May 24, 2015 at 21:44 UTC
Correction: 1 TB module. The late 2013 MacBook Pro model fitted with PCIe SSD using 4 lanes, particularly the 1 TB module, is capable of 0.9 ~ 1.0 GiB per second. I ran on my laptop which is maxed out at 16 GB of memory. Surely, MCE will not run faster than the underlying hardware. There is no reason for not running as fast at the hardware allows either. Although not reaching 0.9 ~ 1.0 GiB, it did run at 0.8 GiB per second with IO being the bottleneck.	[reply]
Re^10: Threads From Hell #2: How To Parse A Very Huge File by BrowserUk (Patriarch) on May 24, 2015 at 23:09 UTC
PCIe SSD using 4 lanes, particularly the 1 TB module, is capable of 0.9 ~ 1.0 GiB per second Impressive hardware. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^9: Threads From Hell #2: How To Parse A Very Huge File by marioroy (Prior) on May 24, 2015 at 22:31 UTC
MCE has the parallel_io option which is not enabled by default. This seems like a good time to try it. Btw, I do not recommend enabling this option if reading from a NFS server or from mechanical drives. IO reads for an input file are normally sequential with only one worker reading at a time. `MCE::Loop::init { max_workers => 4, parallel_io => 1, use_slurpio => 1; }; $ time ./mce_loop_script.pl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl Took 23.541 seconds real 0m23.670s user 0m51.046s sys 0m 7.150s` [download] That chunks at 0.85 GB per second and not able to go any faster.	[reply] [d/l]