ccelt09 has asked for the wisdom of the Perl Monks concerning the following question:

This program counts the occurrences of a single number in a very long string, 200e6 characters long. The first run through counted from start until 2.7e6 characters and returned an output in about twelve seconds. I figured running from 2.76e6 to finish would take no more than 12 x 100 seconds ~ 20 minutes. It's actually on hour 2 and seems to have flatlined. This seems unnecessarily long given the scope of the program. Any thoughts on improving this code?

Can it be done without storing the input file as a scalar or converting it to an array?

#!/usr/bin/perl -w use strict; use warnings; #--------------- my $total_filtered = 0; my $input_file = "$input_dir"."input".".txt"; open(CGS, "<$input_file") or die "can't open $input_file\n"; my $cgs = <CGS>; my $substring = substr($cgs, 2700000); my @array = split(//,$substring); foreach my $line (@array){ if($line =~ /0/){ $total_filtered++; } } print "$total_filtered\n";

Replies are listed 'Best First'.
Re: Improving Efficiency
by BrowserUk (Patriarch) on Aug 31, 2013 at 12:32 UTC

    That is a very slow way to count characters in a string.

    This finds 20 million '0's in a string of 200 million characters in a little over 1/3rd of a second:

    C:\test>p1 $s = '0123456789' x 20e6;; print length $s;; 200000000 say time; printf "found %d zeros\n", $s =~ tr[0][]; say time;; 1377952426.42676 found 20000000 zeros 1377952426.78727

    Update: The probably reason for your code "flatlining" is because creating an array to hold 200 million individual characters probably requires far more memory (~6.4GB on a 64-bit system) than your system actually has, thus your computer is "thrashing".


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Improving Efficiency
by hdb (Monsignor) on Aug 31, 2013 at 12:35 UTC

    If you set $/ to an integer,  <...> reads as many characters each time. Then use tr as in BrowserUks post to count the zeroes.

    use strict; use warnings; my $total_filtered = 0; open my $cgs, "<", "count.txt"; $/ = \10000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; print "Found $total_filtered zeroes.\n";

    UPDATE: Changed 10000 to \10000 as a reference is needed. Thanks to Anonymous monk below!

      ... works when counting single characters but could fail to count a multi-character sequence that falls right along the read-chunk boundary (such that one-half is in one read and the other half is in the other). ... could also present more-serious problems if the characters being sought (or any characters) are multibyte.
Re: Improving Efficiency
by Corion (Patriarch) on Aug 31, 2013 at 13:11 UTC

    I wanted a situation where I could give File::Map a run. And it seems it didn't fare too badly. The weird thing is how badly the readline approach fared against it, and how close it is to the in-memory baseline:

    Update: Fixed the local $/= 10000; bug, as pointed out by Anonymous Monk. The running times were also updated.

    #!perl -w use strict; use 5.010; use Benchmark qw(:all); use File::Map 'map_file'; my $testfile= "$0.testdata"; my $data= '0123456789' x 20e6; # Create the test file. This likely means that it is still hot in the +cache... open my $fh, '>', $testfile or die "Couldn't create '$testfile': $!"; print {$fh} $data; undef $fh; sub tr_in_memory { (my $fn)= @_; my $count= ($data =~ tr[0][0]); $count }; sub tr_map_file { (my $fn)= @_; map_file my($content), $testfile; my $count= ($content =~ tr[0][0]); $count }; sub tr_via_readline_10_000 { my $total_filtered = 0; open my $cgs, "<", $testfile; local $/ = \10_000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; }; sub tr_via_readline_100_000 { my $total_filtered = 0; open my $cgs, "<", $testfile; local $/ = \100_000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; }; sub tr_via_readline_1_000_000 { my $total_filtered = 0; open my $cgs, "<", $testfile; local $/ = \1_000_000; # blocksize $total_filtered += tr/0/0/ while <$cgs>; }; say sprintf "Running with a dataset of %d", length $data; cmpthese( 30, { 'tr_map_file' => \&tr_map_file, 'tr_in_memory' => \&tr_in_memory, 'tr_via_readline 10k' => \&tr_via_readline_10_000, 'tr_via_readline 100k' => \&tr_via_readline_100_000, 'tr_via_readline 1m' => \&tr_via_readline_1_000_000, } );

    Results on my machine (64-bit Windows 7, 64 bit Strawberry Perl 5.18, i7 at 3.50GHz, SSD as storage, 32GB RAM with 4GB in use, so the file was likely hot in the cache anyway):

    X:\>perl -w tmp.pl Running with a dataset of 200000000 Rate tr_via_readline 10k tr_via_readline 100k t +r_via_readline 1m tr_map_file tr_in_memory tr_via_readline 10k 2.18/s -- -15% + -17% -71% -76% tr_via_readline 100k 2.55/s 17% -- + -2% -66% -72% tr_via_readline 1m 2.61/s 20% 2% + -- -66% -72% tr_map_file 7.60/s 249% 198% + 191% -- -18% tr_in_memory 9.25/s 325% 263% + 254% 22% --

    To see if I could get some cache thrashing, I upped the memory size to 20e7, which didn't change the results that much with Perl.exe taking between 4 and 10 GB RAM:

    s/iter tr_via_readline 10k tr_via_readline 100k t +r_via_readline 1m tr_map_file tr_in_memory tr_via_readline 10k 4.61 -- -15% + -17% -65% -77% tr_via_readline 100k 3.91 18% -- + -2% -59% -72% tr_via_readline 1m 3.83 20% 2% + -- -58% -72% tr_map_file 1.60 189% 145% + 140% -- -32% tr_in_memory 1.08 327% 262% + 255% 48% --

    Here it seems that the file isn't as readily available for File::Map anymore, as File::Map slows down remarkably as compared to the in-memory baseline.

      I believe you have typo , blocksize is a reference, requires \
      local $/ = 10000; # blocksize

        Correct. Thanks for spotting it. I have changed it in my post.

Re: Improving Efficiency
by xiaoyafeng (Deacon) on Sep 01, 2013 at 14:20 UTC
    Please refer this node to understand how perl manipulate memory. If you've got enough memory, you can try File::Map as Corion pointed out.




    I am trying to improve my English skills, if you see a mistake please feel free to reply or /msg me a correction