I wanted a situation where I could give File::Map a run. And it seems it didn't fare too badly. The weird thing is how badly the readline approach fared against it, and how close it is to the in-memory baseline:

Update: Fixed the local $/= 10000; bug, as pointed out by Anonymous Monk. The running times were also updated.

#!perl -w use strict; use 5.010; use Benchmark qw(:all); use File::Map 'map_file'; my $testfile= "$0.testdata"; my $data= '0123456789' x 20e6; # Create the test file. This likely means that it is still hot in the +cache... open my $fh, '>', $testfile or die "Couldn't create '$testfile': $!"; print {$fh} $data; undef $fh; sub tr_in_memory { (my $fn)= @_; my $count= ($data =~ tr[0][0]); $count }; sub tr_map_file { (my $fn)= @_; map_file my($content), $testfile; my $count= ($content =~ tr[0][0]); $count }; sub tr_via_readline_10_000 { my $total_filtered = 0; open my $cgs, "<", $testfile; local $/ = \10_000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; }; sub tr_via_readline_100_000 { my $total_filtered = 0; open my $cgs, "<", $testfile; local $/ = \100_000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; }; sub tr_via_readline_1_000_000 { my $total_filtered = 0; open my $cgs, "<", $testfile; local $/ = \1_000_000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; }; say sprintf "Running with a dataset of %d", length $data; cmpthese( 30, { 'tr_map_file' => \&tr_map_file, 'tr_in_memory' => \&tr_in_memory, 'tr_via_readline 10k' => \&tr_via_readline_10_000, 'tr_via_readline 100k' => \&tr_via_readline_100_000, 'tr_via_readline 1m' => \&tr_via_readline_1_000_000, } );

Results on my machine (64-bit Windows 7, 64 bit Strawberry Perl 5.18, i7 at 3.50GHz, SSD as storage, 32GB RAM with 4GB in use, so the file was likely hot in the cache anyway):

X:\>perl -w tmp.pl Running with a dataset of 200000000 Rate tr_via_readline 10k tr_via_readline 100k t +r_via_readline 1m tr_map_file tr_in_memory tr_via_readline 10k 2.18/s -- -15% + -17% -71% -76% tr_via_readline 100k 2.55/s 17% -- + -2% -66% -72% tr_via_readline 1m 2.61/s 20% 2% + -- -66% -72% tr_map_file 7.60/s 249% 198% + 191% -- -18% tr_in_memory 9.25/s 325% 263% + 254% 22% --

To see if I could get some cache thrashing, I upped the memory size to 20e7, which didn't change the results that much with Perl.exe taking between 4 and 10 GB RAM:

s/iter tr_via_readline 10k tr_via_readline 100k t +r_via_readline 1m tr_map_file tr_in_memory tr_via_readline 10k 4.61 -- -15% + -17% -65% -77% tr_via_readline 100k 3.91 18% -- + -2% -59% -72% tr_via_readline 1m 3.83 20% 2% + -- -58% -72% tr_map_file 1.60 189% 145% + 140% -- -32% tr_in_memory 1.08 327% 262% + 255% 48% --

Here it seems that the file isn't as readily available for File::Map anymore, as File::Map slows down remarkably as compared to the in-memory baseline.


In reply to Re: Improving Efficiency by Corion
in thread Improving Efficiency by ccelt09

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.