Re^4: Threads From Hell #2: How To Parse A Very Huge File

Replies are listed 'Best First'.
Re^5: Threads From Hell #2: How To Parse A Very Huge File by marioroy (Prior) on May 24, 2015 at 16:13 UTC
The testing was done on a late 2013 MacBook Pro model (Haswell Core i7) at 2.6 GHz with 1600 MHz memory. Am running Parallels Desktop 9.0. The grep/wc commands and Perl scripts read the file likely residing in OS level file cache from repeated testing.	[reply]
Re^6: Threads From Hell #2: How To Parse A Very Huge File by BrowserUk (Patriarch) on May 24, 2015 at 17:39 UTC
the file likely residing in OS level file cache from repeated testing. Indeed. That's why I used a 10GB file for my testing. I've only got 8GB of ram, so there's no way for the file to get read from cache on subsequent tests. In the real world where the file being searched is coming off a disk or SSD, there is no benefit to multi-tasking grep. Even in the extremely rare case of grepping the same file multiple times, although your numbers: <Reveal this spoiler or all in this thread> show the a reduction in elapsed time, the cpu usage is actually 2.527/2.127 *100 = 19% higher. If the user is (for the sake of a term) an end-user, who types the command and hits enter, the 1 second or so saving is probably less time than it took him to decide what to type and type it; and certainly less than he will take to decide what to do with the information it produces. On the other hand, if the user is a sysadmin guy trying to balance the needs of many processes across a farm of servers, using that extra 19% of cpu resource is probably a bad thing. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l]
Re^7: Threads From Hell #2: How To Parse A Very Huge File by marioroy (Prior) on May 24, 2015 at 19:32 UTC
The OP seemed interested if parallelism is possible for such a task. Please disregard my posts if I have thought wrong. In the spirit of parallelism, I tested a 20 GiB file under the host OS (laptop with 16 GiB) comparing the grep command, bin/mce_grep, examples/egrep.pl and the script using MCE::Loop. Recap: bin/mce_grep is a parallel wrapper for the grep command; examples/egrep.pl is 100% Perl code. I am getting the impression that you're not liking MCE. If that is the case, then I should refrain from posting here. Have you not tried MCE against your 10 GiB file; e.g. bin/mce_grep or examples/egrep.pl? $ ls -lh very_huge.file -rw-r--r-- 1 mario staff 20G May 24 14:53 very_huge.file ## grep command $ time grep karl very_huge.file nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl real 6m47.048s ( 407 seconds ) user 6m42.372s sys 0m 4.669s ## bin/mce_grep $ time ./MCE-1.608/bin/mce_grep karl very_huge.file nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl real 2m17.003s ( 137 seconds ) user 17m 9.223s sys 0m33.223s ## examples/egrep.pl $ time ./MCE-1.608/examples/egrep.pl karl very_huge.file nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl real 0m26.447s user 0m22.527s sys 0m 8.459s ## MCE::Loop script $ time ./mce_loop_script.pl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl nose cuke karl Took 25.650 seconds real 0m25.764s user 0m42.494s sys 0m 7.264s [download] Below, the script using MCE::Loop. use MCE::Loop; use Time::HiRes qw( time ); MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } ); my $start = time; my $pattern = 'karl'; my @result = mce_loop_f { my ($mce, $slurp_ref, $chunk_id) = @_; ## Quickly determine if a match is found. ## Basically, only process slurped chunk if true. if ($$slurp_ref =~ /$pattern/im) { my @matches; open my $MEM_FH, '<', $slurp_ref; binmode $MEM_FH, ':raw'; while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); } close $MEM_FH; MCE->gather(@matches); } } 'very_huge.file'; print join('', @result); printf "Took %.3f seconds\n", time - $start; [download] I have taken the time to answer the OP's request -- free time. It is not worth it anymore at this site, especially when you (being at the Pope level) seem to disprove of MCE. Best regards to all, -mario	[reply] [d/l] [select]
Re^8: Threads From Hell #2: How To Parse A Very Huge File by BrowserUk (Patriarch) on May 24, 2015 at 20:15 UTC
Re^9: Threads From Hell #2: How To Parse A Very Huge File by marioroy (Prior) on May 24, 2015 at 21:44 UTC
Some notes below your chosen depth have not been shown here
Re^9: Threads From Hell #2: How To Parse A Very Huge File by marioroy (Prior) on May 24, 2015 at 22:31 UTC


Pathologically Eclectic Rubbish Lister
	PerlMonks