### Weighted frequency of characters in an array of strings

 on Jun 07, 2016 at 16:32 UTC Need Help??

K_Edw has asked for the wisdom of the Perl Monks concerning the following question:

I have a script which extracts equal length DNA-sequences at user-specified positions from a genome and then calculates the frequency at which each letter (A/G/C/T) occurred at each position.

The user-specific positions can also come with a frequency - how many times that point was seen.

If this frequency is ignored (i.e. position-weighted analysis), the script runs at a satisfactory speed. However, it is when the frequencies are taken into consideration that it is considerably slow.

My approach for this has been:

```if (\$mode eq "Position") {
push (@data, \$string);
} elsif (\$mode eq "Freq") {
push (@data, \$string) for (1..\$F[2]);
}

my %freq;
for my \$i (1..@data) {
\$freq{\$1}[\$-[0]] += 1/@data while \$data[\$i-1] =~ /(.)/g;
}

Which is evidently inefficient and quite literal - i.e. add that sequence X(freq) number of times into the array.

Is there a better way to do weighted frequency calculations and if so, how?

EDIT: The sequences can range from 30 to 250 characters. Weighting occurs at an average of 26, but with a range of 5-3565 for one particular sample. In any given run, around 200,000 unique positions exist within the input data.

Replies are listed 'Best First'.
Re: Weighted frequency of characters in an array of strings
by BrowserUk (Patriarch) on Jun 07, 2016 at 19:53 UTC

Either of these should be 7 to 8 times faster than your original. They're much the same on my machine, with the latter winning on later perls:

```my %freq;
my \$inc = 1 / @data;
for my \$d ( @data ) {
#    \$freq{ substr \$d, \$_, 1 }[ \$_ ] += \$inc for 0 .. length( \$d )-1;
my \$p = 0;
\$freq{ chop \$d }[ \$p++ ] += \$inc while length \$d;
}

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.

Thanks! This is significantly faster (and also easier to read in my opinion) when ran on a real dataset:

Execution Time (Original): 293 seconds

Execution Time (New): 53 seconds

5.52x speed improvement.

Would you be able to explain why this approach is much faster (or how the old one was so slow)?

```my %freq;
for my \$i ( 1 .. @data ) {
\$freq{ \$1 }[ \$-[0] ] += 1/@data while \$data[ \$i-1 ] =~ /(.)/g;
}

For several reasons (in roughly scale of penalty):

1. It starts the regex engine for every line.
1. Just starting the regex engine, even with a simple pattern, is expensive.
2. I'm not sure whether that simple pattern get re-parsed each time; or if the parsing is cached, but its still parsing.
3. It uses capture brackets, which means that not only does \$1 have to be populated, but also @+, @-, \$&, \$`, \$' and possibly %+ ...
2. It performs the same invariant calculation (1/@data) for every line.

Moving that out of the loop and adding the invariant is simple and effective.

3. It does a completely unnecessary calculation (\$i-1) for every line.

Changing the loop parameters to for my \$i ( 0 .. \$#data ) avoids that repeated subtraction.

4. It indexes into @data for every line.

Why generate a list of integers, and then use those integers to index into the array, when you can access the array elements directly?

Most of those can be avoided like this:

```my \$inc = 1 / @data;                                       ## do the i
+nvariant once:
for ( @data ) {                                            ## \$_ alias
+es each line in turn avoids both the \$i-1 and indexing the array
\$freq{ \$1 }[ \$-[0] ] += \$inc while /(.)/g;             ## the rege
+x engine operates directly on the \$_ alias.
}

But that still incurs the penalty of the regex engine to get the chars 1 at a time. A naive alternative is ... for split //; but that still involves the regex engine, albeit in a faster mode.

The regex can be avoided completely using ... for unpack '(a1)*', \$_;, but that template still requires parsing, albeit a simpler language; but the big penalty here is the construction of a long list of single chars, each in its own scalar that each require over 100 bytes of memory. That is a costly process.

Contrast that with the substr version \$freq{ substr \$data, \$_, 1 }[ \$_ ] += \$inc for 0 .. length( \$d )-1;. That numerical iterator costs only ~20 bytes regardless of the length of the string.

My other version trades the (fairly expensive) call to substr and the iterator, for a very cheap operation chop and an increment. And a little obscurity.

It can however be made to be the fastest of the lot with a couple of simple changes:

```for( @data ) {                                 ## use \$_ to alias the
+lines
my \$p = -1;                                ## Allow pre-increment
+which is more efficient
\$freq{ chop() }[ ++\$p ] += \$inc while \$_;  ## implicit arg for cho
+p; avoid call to length.
}

That might see you over the 6X mark. Or not.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.
In addition to substr being faster than a regex, your code is computing:
```1/@data
200,000 times. Computing this value only once before looping over the data and storing the result into a variable is very likely to save you some significant time (but benchmarking and/or profiling would be needed to say how much time). Storing the result of a computation to avoid re-computing it many times is called caching or memoizing (see for example https://en.wikipedia.org/wiki/Memoization) and is a very common optimization technique.
Re: Weighted frequency of characters in an array of strings
by Laurent_R (Canon) on Jun 07, 2016 at 17:03 UTC
Hm, not entirely sure what will make things faster, but a few ideas:

You probably don't need to store your data into an array, and then go through the array, compute your counters right away when reading the data;

Use a hash of counters : \$freq{\$_} ++;

Don't do the division for each single letter: compute the counters, and only at the end, do the four needed divisions; or calculate 1/@data only once into a variable, and use the variable;

Using tr// is likely to be faster than a regex to count the letters. For example:

```\$c = "ACCCTGATTGC";
\$d = \$c =~ tr/A/A/;
print \$d; # prints 2
Re: Weighted frequency of characters in an array of strings
by Anonymous Monk on Jun 07, 2016 at 23:26 UTC

You could compress the sequences (substrings) into vectors like so:

```my \$bitvec = pack "H*", (\$string =~ tr/AGCT/1248/r);
The AGCT values will be interleaved in the resulting bitvector. Better yet, transform the full sequence.

Convert the bitvectors into integer vectors and sum them up. Better yet and if memory allows, create one big vector from the full sequence (with 4*length elements), as the first step.

Then, using an appropriate module with overloading, you might simply (completely untested code):

```my @arrr = split //, unpack "B*", \$bitvec;
my \$histogram = Math::GSL::Vector->new();
my \$sequence = Math::GSL::Vector->new( \@arrr );

while ((\$pos,\$freq) = each %points) {
my \$view = gsl_vector_subvector( \$sequence, 4 * \$pos, 4 * \$slen );

\$histogram += \$view * \$freq;
\$sum_freq += \$freq;
}

# normalize and uninterleave
\$histogram *= 1 / \$sum_freq;
@parts = part { \$i++ % 4 } \$histogram->as_list;
But then, there probably exist specific modules for histogramming...

Re: Weighted frequency of characters in an array of strings
by anonymized user 468275 (Curate) on Jun 08, 2016 at 12:40 UTC
It sounds like your input is in FASTQ format...? So the optimum algorithm should pass through the data ideally once.

Chopping from the back of each sequence in a single pass seems like a quick way as opposed to for example putting the sequence more than once through tr or some other regex processor.

I am not sure what your target performance is, but I picked up an example fastq file of just over 78MB and put it through this simple piece of code that uses one pass chopping. The fourteen seconds (on my 3.5GHz PC) suggest that 1 gygabyte would take about three minutes; note that the array of hash does NOT grow as more sequences are processed, so the prediction is scaleable.

```#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;
my (\$file) = glob '*.fastq';

open \$fh, \$file;

my \$analysis = []; # each element a sequence position
# at which we will put a hash of freqs
print localtime() . "\n";
while ( my \$hdr = <\$fh>) {
my \$seq = <\$fh>;
chomp \$seq;
analyse(\$seq, \$analysis);
\$seq = <\$fh> for (0 .. 1); # discard trailer
}
print localtime() . "\n";
print Dumper(\$analysis);

sub analyse {
my (\$seq, \$analysis) = @_;
my \$pos = length(\$seq);
while (\$pos) {
\$analysis->[--\$pos]{chop \$seq}++;
}
}
which gave the following output:

One world, one people

Re: Weighted frequency of characters in an array of strings
by Anonymous Monk on Jun 07, 2016 at 19:57 UTC

How about some real input data, and the actual output you want from it?

Re: Weighted frequency of characters in an array of strings
by Anonymous Monk on Jun 07, 2016 at 19:06 UTC

How long are the sequences, and how many are extracted? Also, how high does the weighting go?

The more details we get of your problem, the better solutions we may be able to find.

Updated the OP with additional information.

How many sequences in a run?

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1165092]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2022-01-18 22:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
In 2022, my preferred method to securely store passwords is:

Results (54 votes). Check out past polls.

Notices?