Your OP code is slow:

my %freq; for my $i ( 1 .. @data ) { $freq{ $1 }[ $-[0] ] += 1/@data while $data[ $i-1 ] =~ /(.)/g; }

For several reasons (in roughly scale of penalty):

  1. It starts the regex engine for every line.
    1. Just starting the regex engine, even with a simple pattern, is expensive.
    2. I'm not sure whether that simple pattern get re-parsed each time; or if the parsing is cached, but its still parsing.
    3. It uses capture brackets, which means that not only does $1 have to be populated, but also @+, @-, $&, $`, $' and possibly %+ ...
  2. It performs the same invariant calculation (1/@data) for every line.

    Moving that out of the loop and adding the invariant is simple and effective.

  3. It does a completely unnecessary calculation ($i-1) for every line.

    Changing the loop parameters to for my $i ( 0 .. $#data ) avoids that repeated subtraction.

  4. It indexes into @data for every line.

    Why generate a list of integers, and then use those integers to index into the array, when you can access the array elements directly?

Most of those can be avoided like this:

my $inc = 1 / @data; ## do the i +nvariant once: for ( @data ) { ## $_ alias +es each line in turn avoids both the $i-1 and indexing the array $freq{ $1 }[ $-[0] ] += $inc while /(.)/g; ## the rege +x engine operates directly on the $_ alias. }

But that still incurs the penalty of the regex engine to get the chars 1 at a time. A naive alternative is ... for split //; but that still involves the regex engine, albeit in a faster mode.

The regex can be avoided completely using ... for unpack '(a1)*', $_;, but that template still requires parsing, albeit a simpler language; but the big penalty here is the construction of a long list of single chars, each in its own scalar that each require over 100 bytes of memory. That is a costly process.

Contrast that with the substr version $freq{ substr $data, $_, 1 }[ $_ ] += $inc for 0 .. length( $d )-1;. That numerical iterator costs only ~20 bytes regardless of the length of the string.

My other version trades the (fairly expensive) call to substr and the iterator, for a very cheap operation chop and an increment. And a little obscurity.

It can however be made to be the fastest of the lot with a couple of simple changes:

for( @data ) { ## use $_ to alias the +lines my $p = -1; ## Allow pre-increment +which is more efficient $freq{ chop() }[ ++$p ] += $inc while $_; ## implicit arg for cho +p; avoid call to length. }

That might see you over the 6X mark. Or not.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.

In reply to Re^3: Weighted frequency of characters in an array of strings by BrowserUk
in thread Weighted frequency of characters in an array of strings by K_Edw

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.