Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: Weighted frequency of characters in an array of strings

by K_Edw (Beadle)
on Jun 08, 2016 at 08:37 UTC ( [id://1165133]=note: print w/replies, xml ) Need Help??


in reply to Re: Weighted frequency of characters in an array of strings
in thread Weighted frequency of characters in an array of strings

Thanks! This is significantly faster (and also easier to read in my opinion) when ran on a real dataset:

Execution Time (Original): 293 seconds

Execution Time (New): 53 seconds

5.52x speed improvement.

Would you be able to explain why this approach is much faster (or how the old one was so slow)?

  • Comment on Re^2: Weighted frequency of characters in an array of strings

Replies are listed 'Best First'.
Re^3: Weighted frequency of characters in an array of strings
by BrowserUk (Patriarch) on Jun 08, 2016 at 09:59 UTC

    Your OP code is slow:

    my %freq; for my $i ( 1 .. @data ) { $freq{ $1 }[ $-[0] ] += 1/@data while $data[ $i-1 ] =~ /(.)/g; }

    For several reasons (in roughly scale of penalty):

    1. It starts the regex engine for every line.
      1. Just starting the regex engine, even with a simple pattern, is expensive.
      2. I'm not sure whether that simple pattern get re-parsed each time; or if the parsing is cached, but its still parsing.
      3. It uses capture brackets, which means that not only does $1 have to be populated, but also @+, @-, $&, $`, $' and possibly %+ ...
    2. It performs the same invariant calculation (1/@data) for every line.

      Moving that out of the loop and adding the invariant is simple and effective.

    3. It does a completely unnecessary calculation ($i-1) for every line.

      Changing the loop parameters to for my $i ( 0 .. $#data ) avoids that repeated subtraction.

    4. It indexes into @data for every line.

      Why generate a list of integers, and then use those integers to index into the array, when you can access the array elements directly?

    Most of those can be avoided like this:

    my $inc = 1 / @data; ## do the i +nvariant once: for ( @data ) { ## $_ alias +es each line in turn avoids both the $i-1 and indexing the array $freq{ $1 }[ $-[0] ] += $inc while /(.)/g; ## the rege +x engine operates directly on the $_ alias. }

    But that still incurs the penalty of the regex engine to get the chars 1 at a time. A naive alternative is ... for split //; but that still involves the regex engine, albeit in a faster mode.

    The regex can be avoided completely using ... for unpack '(a1)*', $_;, but that template still requires parsing, albeit a simpler language; but the big penalty here is the construction of a long list of single chars, each in its own scalar that each require over 100 bytes of memory. That is a costly process.

    Contrast that with the substr version $freq{ substr $data, $_, 1 }[ $_ ] += $inc for 0 .. length( $d )-1;. That numerical iterator costs only ~20 bytes regardless of the length of the string.

    My other version trades the (fairly expensive) call to substr and the iterator, for a very cheap operation chop and an increment. And a little obscurity.

    It can however be made to be the fastest of the lot with a couple of simple changes:

    for( @data ) { ## use $_ to alias the +lines my $p = -1; ## Allow pre-increment +which is more efficient $freq{ chop() }[ ++$p ] += $inc while $_; ## implicit arg for cho +p; avoid call to length. }

    That might see you over the 6X mark. Or not.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.

      The chop version is wrong, it's trimming chars from the end, but counting them from zero. This fixes that:

      for( @data ) { ## use $_ to alias the +lines my $p = length(); ## Allow pre-decrement +and count backwards $freq{ chop() }[ --$p ] += $inc while $p; ## implicit arg for cho +p; avoid call to length. }

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.
Re^3: Weighted frequency of characters in an array of strings
by Laurent_R (Canon) on Jun 08, 2016 at 09:35 UTC
    In addition to substr being faster than a regex, your code is computing:
    1/@data
    200,000 times. Computing this value only once before looping over the data and storing the result into a variable is very likely to save you some significant time (but benchmarking and/or profiling would be needed to say how much time). Storing the result of a computation to avoid re-computing it many times is called caching or memoizing (see for example https://en.wikipedia.org/wiki/Memoization) and is a very common optimization technique.
Re^3: Weighted frequency of characters in an array of strings
by Anonymous Monk on Jun 08, 2016 at 09:02 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1165133]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-04-18 20:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found