Re^3: Weighted frequency of characters in an array of strings

Your OP code is slow:

my %freq;
for my $i ( 1 .. @data ) {
    $freq{ $1 }[ $-[0] ] += 1/@data while $data[ $i-1 ] =~ /(.)/g;
}
[download]

For several reasons (in roughly scale of penalty):

It starts the regex engine for every line.
1. Just starting the regex engine, even with a simple pattern, is expensive.
2. I'm not sure whether that simple pattern get re-parsed each time; or if the parsing is cached, but its still parsing.
3. It uses capture brackets, which means that not only does $1 have to be populated, but also @+, @-, $&, $`, $' and possibly %+ ...
It performs the same invariant calculation (1/@data) for every line.
Moving that out of the loop and adding the invariant is simple and effective.
It does a completely unnecessary calculation ($i-1) for every line.
Changing the loop parameters to for my $i ( 0 .. $#data ) avoids that repeated subtraction.
It indexes into @data for every line.
Why generate a list of integers, and then use those integers to index into the array, when you can access the array elements directly?

Most of those can be avoided like this:

my $inc = 1 / @data;                                       ## do the i
+nvariant once:
for ( @data ) {                                            ## $_ alias
+es each line in turn avoids both the $i-1 and indexing the array
    $freq{ $1 }[ $-[0] ] += $inc while /(.)/g;             ## the rege
+x engine operates directly on the $_ alias.
}
[download]

But that still incurs the penalty of the regex engine to get the chars 1 at a time. A naive alternative is ... for split //; but that still involves the regex engine, albeit in a faster mode.

The regex can be avoided completely using ... for unpack '(a1)*', $_;, but that template still requires parsing, albeit a simpler language; but the big penalty here is the construction of a long list of single chars, each in its own scalar that each require over 100 bytes of memory. That is a costly process.

Contrast that with the substr version $freq{ substr $data, $_, 1 }[ $_ ] += $inc for 0 .. length( $d )-1;. That numerical iterator costs only ~20 bytes regardless of the length of the string.

My other version trades the (fairly expensive) call to substr and the iterator, for a very cheap operation chop and an increment. And a little obscurity.

It can however be made to be the fastest of the lot with a couple of simple changes:

for( @data ) {                                 ## use $_ to alias the 
+lines
    my $p = -1;                                ## Allow pre-increment 
+which is more efficient
    $freq{ chop() }[ ++$p ] += $inc while $_;  ## implicit arg for cho
+p; avoid call to length.
}
[download]

That might see you over the 6X mark. Or not.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.

Comment on Re^3: Weighted frequency of characters in an array of strings Select or Download Code

Replies are listed 'Best First'.
Re^4: Weighted frequency of characters in an array of strings (Bug fix.) by BrowserUk (Patriarch) on Jun 08, 2016 at 12:16 UTC
The chop version is wrong, it's trimming chars from the end, but counting them from zero. This fixes that: `for( @data ) { ## use $_ to alias the +lines my $p = length(); ## Allow pre-decrement +and count backwards $freq{ chop() }[ --$p ] += $inc while $p; ## implicit arg for cho +p; avoid call to length. }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.	[reply] [d/l]

Replies are listed 'Best First'.

Re^4: Weighted frequency of characters in an array of strings (Bug fix.)
by BrowserUk (Patriarch) on Jun 08, 2016 at 12:16 UTC

The chop version is wrong, it's trimming chars from the end, but counting them from zero. This fixes that:

for( @data ) {                                 ## use $_ to alias the 
+lines
    my $p = length();                          ## Allow pre-decrement 
+and count backwards
    $freq{ chop() }[ --$p ] += $inc while $p;  ## implicit arg for cho
+p; avoid call to length.
}
[download]

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.

[reply]
[d/l]