Improving Efficiency

ccelt09 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Improving Efficiency by BrowserUk (Patriarch) on Aug 31, 2013 at 12:32 UTC
That is a very slow way to count characters in a string. This finds 20 million '0's in a string of 200 million characters in a little over 1/3rd of a second: `C:\test>p1 $s = '0123456789' x 20e6;; print length $s;; 200000000 say time; printf "found %d zeros\n", $s =~ tr[0][]; say time;; 1377952426.42676 found 20000000 zeros 1377952426.78727` [download] Update: The probably reason for your code "flatlining" is because creating an array to hold 200 million individual characters probably requires far more memory (~6.4GB on a 64-bit system) than your system actually has, thus your computer is "thrashing". With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Improving Efficiency by hdb (Monsignor) on Aug 31, 2013 at 12:35 UTC
If you set `$/` to an integer, `<...>` reads as many characters each time. Then use tr as in BrowserUks post to count the zeroes. `use strict; use warnings; my $total_filtered = 0; open my $cgs, "<", "count.txt"; $/ = \10000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; print "Found $total_filtered zeroes.\n";` [download] UPDATE: Changed 10000 to \10000 as a reference is needed. Thanks to Anonymous monk below!	[reply] [d/l] [select]
Re^2: Improving Efficiency by Anonymous Monk on Sep 01, 2013 at 18:36 UTC
... works when counting single characters but could fail to count a multi-character sequence that falls right along the read-chunk boundary (such that one-half is in one read and the other half is in the other). ... could also present more-serious problems if the characters being sought (or any characters) are multibyte.	[reply]
Re: Improving Efficiency by Corion (Patriarch) on Aug 31, 2013 at 13:11 UTC
I wanted a situation where I could give File::Map a run. And it seems it didn't fare too badly. The weird thing is how badly the `readline` approach fared against it, and how close it is to the in-memory baseline: Update: Fixed the `local $/= 10000;` bug, as pointed out by Anonymous Monk. The running times were also updated. #!perl -w use strict; use 5.010; use Benchmark qw(:all); use File::Map 'map_file'; my $testfile= "$0.testdata"; my $data= '0123456789' x 20e6; # Create the test file. This likely means that it is still hot in the +cache... open my $fh, '>', $testfile or die "Couldn't create '$testfile': $!"; print {$fh} $data; undef $fh; sub tr_in_memory { (my $fn)= @_; my $count= ($data =~ tr[0][0]); $count }; sub tr_map_file { (my $fn)= @_; map_file my($content), $testfile; my $count= ($content =~ tr[0][0]); $count }; sub tr_via_readline_10_000 { my $total_filtered = 0; open my $cgs, "<", $testfile; local $/ = \10_000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; }; sub tr_via_readline_100_000 { my $total_filtered = 0; open my $cgs, "<", $testfile; local $/ = \100_000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; }; sub tr_via_readline_1_000_000 { my $total_filtered = 0; open my $cgs, "<", $testfile; local $/ = \1_000_000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; }; say sprintf "Running with a dataset of %d", length $data; cmpthese( 30, { 'tr_map_file' => \&tr_map_file, 'tr_in_memory' => \&tr_in_memory, 'tr_via_readline 10k' => \&tr_via_readline_10_000, 'tr_via_readline 100k' => \&tr_via_readline_100_000, 'tr_via_readline 1m' => \&tr_via_readline_1_000_000, } ); [download] Results on my machine (64-bit Windows 7, 64 bit Strawberry Perl 5.18, i7 at 3.50GHz, SSD as storage, 32GB RAM with 4GB in use, so the file was likely hot in the cache anyway): `X:\>perl -w tmp.pl Running with a dataset of 200000000 Rate tr_via_readline 10k tr_via_readline 100k t +r_via_readline 1m tr_map_file tr_in_memory tr_via_readline 10k 2.18/s -- -15% + -17% -71% -76% tr_via_readline 100k 2.55/s 17% -- + -2% -66% -72% tr_via_readline 1m 2.61/s 20% 2% + -- -66% -72% tr_map_file 7.60/s 249% 198% + 191% -- -18% tr_in_memory 9.25/s 325% 263% + 254% 22% --` [download] To see if I could get some cache thrashing, I upped the memory size to 20e7, which didn't change the results that much with Perl.exe taking between 4 and 10 GB RAM: `s/iter tr_via_readline 10k tr_via_readline 100k t +r_via_readline 1m tr_map_file tr_in_memory tr_via_readline 10k 4.61 -- -15% + -17% -65% -77% tr_via_readline 100k 3.91 18% -- + -2% -59% -72% tr_via_readline 1m 3.83 20% 2% + -- -58% -72% tr_map_file 1.60 189% 145% + 140% -- -32% tr_in_memory 1.08 327% 262% + 255% 48% --` [download] Here it seems that the file isn't as readily available for File::Map anymore, as File::Map slows down remarkably as compared to the in-memory baseline.	[reply] [d/l] [select]
Re^2: Improving Efficiency by Anonymous Monk on Aug 31, 2013 at 13:22 UTC
I believe you have typo , blocksize is a reference, requires \ `local $/ = 10000; # blocksize` [download]	[reply] [d/l]
Re^3: Improving Efficiency by hdb (Monsignor) on Aug 31, 2013 at 13:42 UTC
Correct. Thanks for spotting it. I have changed it in my post.	[reply] [d/l]
Re: Improving Efficiency by xiaoyafeng (Deacon) on Sep 01, 2013 at 14:20 UTC
Please refer this node to understand how perl manipulate memory. If you've got enough memory, you can try File::Map as Corion pointed out. I am trying to improve my English skills, if you see a mistake please feel free to reply or /msg me a correction	[reply]