Weighted frequency of characters in an array of strings

K_Edw has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Weighted frequency of characters in an array of strings by BrowserUk (Patriarch) on Jun 07, 2016 at 19:53 UTC
Either of these should be 7 to 8 times faster than your original. They're much the same on my machine, with the latter winning on later perls: `my %freq; my $inc = 1 / @data; for my $d ( @data ) { # $freq{ substr $d, $_, 1 }[ $_ ] += $inc for 0 .. length( $d )-1; my $p = 0; $freq{ chop $d }[ $p++ ] += $inc while length $d; }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.	[reply] [d/l]
Re^2: Weighted frequency of characters in an array of strings by K_Edw (Beadle) on Jun 08, 2016 at 08:37 UTC
Thanks! This is significantly faster (and also easier to read in my opinion) when ran on a real dataset: Execution Time (Original): 293 seconds Execution Time (New): 53 seconds 5.52x speed improvement. Would you be able to explain why this approach is much faster (or how the old one was so slow)?	[reply]
Re^3: Weighted frequency of characters in an array of strings by BrowserUk (Patriarch) on Jun 08, 2016 at 09:59 UTC
Your OP code is slow: `my %freq; for my $i ( 1 .. @data ) { $freq{ $1 }[ $-[0] ] += 1/@data while $data[ $i-1 ] =~ /(.)/g; }` [download] For several reasons (in roughly scale of penalty): It starts the regex engine for every line. Just starting the regex engine, even with a simple pattern, is expensive. I'm not sure whether that simple pattern get re-parsed each time; or if the parsing is cached, but its still parsing. It uses capture brackets, which means that not only does $1 have to be populated, but also @+, @-, $&, $`, $' and possibly %+ ... It performs the same invariant calculation (1/@data) for every line. Moving that out of the loop and adding the invariant is simple and effective. It does a completely unnecessary calculation ($i-1) for every line. Changing the loop parameters to `for my $i ( 0 .. $#data )` avoids that repeated subtraction. It indexes into @data for every line. Why generate a list of integers, and then use those integers to index into the array, when you can access the array elements directly? Most of those can be avoided like this: `my $inc = 1 / @data; ## do the i +nvariant once: for ( @data ) { ## $_ alias +es each line in turn avoids both the $i-1 and indexing the array $freq{ $1 }[ $-[0] ] += $inc while /(.)/g; ## the rege +x engine operates directly on the $_ alias. }` [download] But that still incurs the penalty of the regex engine to get the chars 1 at a time. A naive alternative is `... for split //;` but that still involves the regex engine, albeit in a faster mode. The regex can be avoided completely using `... for unpack '(a1)*', $_;`, but that template still requires parsing, albeit a simpler language; but the big penalty here is the construction of a long list of single chars, each in its own scalar that each require over 100 bytes of memory. That is a costly process. Contrast that with the substr version `$freq{ substr $data, $_, 1 }[ $_ ] += $inc for 0 .. length( $d )-1;`. That numerical iterator costs only ~20 bytes regardless of the length of the string. My other version trades the (fairly expensive) call to substr and the iterator, for a very cheap operation chop and an increment. And a little obscurity. It can however be made to be the fastest of the lot with a couple of simple changes: `for( @data ) { ## use $_ to alias the +lines my $p = -1; ## Allow pre-increment +which is more efficient $freq{ chop() }[ ++$p ] += $inc while $_; ## implicit arg for cho +p; avoid call to length. }` [download] That might see you over the 6X mark. Or not. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.	[reply] [d/l] [select]
Re^4: Weighted frequency of characters in an array of strings (Bug fix.) by BrowserUk (Patriarch) on Jun 08, 2016 at 12:16 UTC
Re^3: Weighted frequency of characters in an array of strings by Laurent_R (Canon) on Jun 08, 2016 at 09:35 UTC
In addition to substr being faster than a regex, your code is computing: `1/@data` [download] 200,000 times. Computing this value only once before looping over the data and storing the result into a variable is very likely to save you some significant time (but benchmarking and/or profiling would be needed to say how much time). Storing the result of a computation to avoid re-computing it many times is called caching or memoizing (see for example https://en.wikipedia.org/wiki/Memoization) and is a very common optimization technique.	[reply] [d/l]
Re^3: Weighted frequency of characters in an array of strings by Anonymous Monk on Jun 08, 2016 at 09:02 UTC
substr and chop do only one thing, which means they're much faster than regex engine Also `while...=~/(.)/g;` is fastest written as `for split '',...;`, For a sea of benchmarks/topic/"explanations" about these tricks see Re: How to make this substitutions without splitting the strings? (compute not reckon)	[reply] [d/l] [select]
Re: Weighted frequency of characters in an array of strings by Laurent_R (Canon) on Jun 07, 2016 at 17:03 UTC
Hm, not entirely sure what will make things faster, but a few ideas: You probably don't need to store your data into an array, and then go through the array, compute your counters right away when reading the data; Use a hash of counters : `$freq{$_} ++`; Don't do the division for each single letter: compute the counters, and only at the end, do the four needed divisions; or calculate 1/@data only once into a variable, and use the variable; Using `tr//` is likely to be faster than a regex to count the letters. For example: `$c = "ACCCTGATTGC"; $d = $c =~ tr/A/A/; print $d; # prints 2` [download]	[reply] [d/l] [select]
Re: Weighted frequency of characters in an array of strings by Anonymous Monk on Jun 07, 2016 at 23:26 UTC
You could compress the sequences (substrings) into vectors like so: `my $bitvec = pack "H", ($string =~ tr/AGCT/1248/r);` [download] The AGCT values will be interleaved in the resulting bitvector. Better yet, transform the full sequence. Convert the bitvectors into integer vectors and sum them up. Better yet and if memory allows, create one big vector from the full sequence (with 4length elements), as the first step. Then, using an appropriate module with overloading, you might simply (completely untested code): `my @arrr = split //, unpack "B", $bitvec; my $histogram = Math::GSL::Vector->new(); my $sequence = Math::GSL::Vector->new( \@arrr ); while (($pos,$freq) = each %points) { my $view = gsl_vector_subvector( $sequence, 4 $pos, 4 * $slen ); $histogram += $view * $freq; $sum_freq += $freq; } # normalize and uninterleave $histogram *= 1 / $sum_freq; @parts = part { $i++ % 4 } $histogram->as_list;` [download] But then, there probably exist specific modules for histogramming...	[reply] [d/l] [select]
Re: Weighted frequency of characters in an array of strings by anonymized user 468275 (Curate) on Jun 08, 2016 at 12:40 UTC
It sounds like your input is in FASTQ format...? So the optimum algorithm should pass through the data ideally once. Chopping from the back of each sequence in a single pass seems like a quick way as opposed to for example putting the sequence more than once through tr or some other regex processor. I am not sure what your target performance is, but I picked up an example fastq file of just over 78MB and put it through this simple piece of code that uses one pass chopping. The fourteen seconds (on my 3.5GHz PC) suggest that 1 gygabyte would take about three minutes; note that the array of hash does NOT grow as more sequences are processed, so the prediction is scaleable. #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my ($file) = glob '.fastq'; open $fh, $file; my $analysis = []; # each element a sequence position # at which we will put a hash of freqs print localtime() . "\n"; while ( my $hdr = <$fh>) { my $seq = <$fh>; chomp $seq; analyse($seq, $analysis); $seq = <$fh> for (0 .. 1); # discard trailer } print localtime() . "\n"; print Dumper($analysis); sub analyse { my ($seq, $analysis) = @_; my $pos = length($seq); while ($pos) { $analysis->[--$pos]{chop $seq}++; } } [download] which gave the following output: Read more... (19 kB) One world, one people*	[reply] [d/l] [select]
Re: Weighted frequency of characters in an array of strings by Anonymous Monk on Jun 07, 2016 at 19:57 UTC
How about some real input data, and the actual output you want from it?	[reply]
Re: Weighted frequency of characters in an array of strings by Anonymous Monk on Jun 07, 2016 at 19:06 UTC
How long are the sequences, and how many are extracted? Also, how high does the weighting go? The more details we get of your problem, the better solutions we may be able to find.	[reply]
Re^2: Weighted frequency of characters in an array of strings by K_Edw (Beadle) on Jun 07, 2016 at 19:14 UTC
Updated the OP with additional information.	[reply]
Re^3: Weighted frequency of characters in an array of strings by Anonymous Monk on Jun 07, 2016 at 19:20 UTC
How many sequences in a run?	[reply]


Welcome to the Monastery
	PerlMonks