Re: The sum of absolute differences in the counts of chars in two strings.
by Corion (Patriarch) on Nov 19, 2011 at 15:41 UTC
|
If the used alphabet is known and most of the letters are likely to be used and the strings are not too long, I think that it would be worthwhile to investigate tr/// over split+count, potentially by generating one huge eval string:
my $code = join "\n;\n", map {sprintf '$count{ $_ } = tr/%s/%s/', $_,
+$_ } @alphabet;
my %count;
local $_ = $string1;
eval $code;
The second string counting could be rolled into a second eval:
my $code2 = join "\n+\n", map {sprintf 'abs($count{ %s } - tr/%s/%s/)'
+, $_, $_, $_ } @alphabet;
local $_ = $string2;
eval $code2;
But as the tr-approach reads the strings multiple times, this will really depend on the length of the strings and the size of the alphabet. | [reply] [d/l] [select] |
Re: The sum of absolute differences in the counts of chars in two strings.
by Perlbotics (Archbishop) on Nov 19, 2011 at 16:47 UTC
|
Here's my brute-force try. Don't know if the call-overhead is bigger than the gain?
I assumed a one-byte-is-one-character relationship (no wchar's yet). HTH
use Inline C;
use strict;
use warnings;
use Time::HiRes qw(time);
my $str1 = "abcdefabc";
my $str2 = "xyzabcabc@@@";
my $now = time;
my $res;
$res = deltasum( $str1, $str2 ) for (1..1_000_000);
print "res=$res time=", ( time() - $now ), "us\n";
#res=9 time=0.819153070449829us
__END__
__C__
#define DEBUG(x)
int counter_tab[256];
int deltasum(char* a, char* b) {
int i;
int sum = 0;
DEBUG( printf("IN: (%d:%s) (%d:%s)\n", strlen(a), a, strlen(b), b);
+)
bzero( counter_tab, sizeof( counter_tab ) );
for ( ; *a ; ++a ) ++counter_tab[ *a ];
for ( ; *b ; ++b ) --counter_tab[ *b ];
for ( i = 0; i < 256; ++i ) { /* maybe reduced to significant range?
+ */
sum += abs( counter_tab[i] );
DEBUG( if ( counter_tab[i] ) printf( "'%c' (%3d) x %5d\n", i, i, c
+ounter_tab[i]); )
}
DEBUG( printf("OUT: %d\n", sum); )
return sum;
}
Update: Moved i and sum to head of function to cope with ancient compilers (thanks to davido). Simplified sizeof() expression.
Debug sample:
IN: (9:abcdefabc) (12:xyzabcabc@@@)
'@' ( 64) x -3
'd' (100) x 1
'e' (101) x 1
'f' (102) x 1
'x' (120) x -1
'y' (121) x -1
'z' (122) x -1
OUT: 9
Update: Approx. 60% speed-up when limiting to [ACGTacgt], still room for improvement...
| [reply] [d/l] [select] |
Re: The sum of absolute differences in the counts of chars in two strings.
by roboticus (Chancellor) on Nov 19, 2011 at 17:31 UTC
|
$ cat 938978.pl
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($x, $y) = split /\s+/,$_;
last if ! defined $y;
print robo_1($x,$y), " <$x> <$y>\n";
}
sub robo_1 {
my %h;
$h{$_}++ for split//,shift;
$h{$_}-- for split//,shift;
my $t=0;
$t += ($_>0 ? $_ : -$_) for values %h;
return $t;
}
__DATA__
123456 abcdefghijkl
ARF_ARF_ARF BARKBARKBARK
doom gloom
heartbreak anguish
$ perl 938978.pl
18 <123456> <abcdefghijkl>
11 <ARF_ARF_ARF> <BARKBARKBARK>
3 <doom> <gloom>
13 <heartbreak> <anguish>
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] |
Re: The sum of absolute differences in the counts of chars in two strings.
by mbethke (Hermit) on Nov 19, 2011 at 18:03 UTC
|
Now I just saw Perlbotics has posted an Inline C solution already but anyway, this is a bit simpler:
use Inline C;
print char_freq_diff_sum("123456", "abcdefghijkl"),"\n";
__DATA__
__C__
#define abs(x) (x<0?-x:x) /* no need for math.h */
int char_freq_diff_sum(char *s, char *t) {
int i, sum=0, freqs[256]={0};
for(i=0; i<strlen(s); ++i) freqs[s[i]]++;
for(i=0; i<strlen(t); ++i) freqs[t[i]]--;
for(i=0; i<256; ++i) sum += abs(freqs[i]);
return sum;
}
Of course it makes the same dangerous assumptions like 8-bit character set. Unicode from UCS2 should be possible using the same principle though, a couple orders of magnitude slower but still much faster than pure Perl. | [reply] [d/l] |
|
|
Do note that in C, strlen isn't a constant time operation - it walks the string looking for a NUL byte. So, for speed issues, one either caches the result, or does away with the index altogether:
while (*s) {freq[*s++]++;}
while (*t) {freq[*t++]--;}
Your freqs[256]={0}; is unfamiliar syntax to me. Some gcc modernism to initialize the array? | [reply] [d/l] [select] |
|
|
int freqs[256]={0};
I think that syntax is part of C99, inherited back as a "good idea" from C++. If the initializer list in curly brackets is shorter than the number of elements in the entity being initialized, all elements not enumerated in the list will initialize to zero. Ex: int freqs[256]={1}; would initialize as freqs[0] = 1, freqs[1] = 0, freqs[2] = 0, etc. gcc offers some additional features beyond that, but the syntax used here should be portable to any C99 implementation.
| [reply] [d/l] [select] |
|
|
I'm not sure when this array initialization was standardized but since I learned C almost 20 years ago I haven't been treated to any compiler that didn't understand it.
You've got a point about the strlen. Actually I had it in a separate function first that got a const char* string argument, in that case any optimizer worth its name should notice it's an invariant and do the caching transparently. I haven't looked at the compiled code so it may be that a non-const string would disable this optimization, although the code is simple enough that I'd think the optimizer would notice the string is never modified inside the loop and do it anyway.
| [reply] |
Re: The sum of absolute differences in the counts of chars in two strings.
by Limbic~Region (Chancellor) on Nov 19, 2011 at 17:24 UTC
|
BrowserUk,
How long are the strings (min, max, avg)? Will they be the same length? Are the characters single byte (8 bits)? Assuming 1 byte characters, is the alphabet of possible characters restricted or can we assume all 256 possibilities?
Also, can you give a manual example? I am concerned about misunderstanding the requirements.
| [reply] |
|
|
aaaaacaacaaagcc :: a=>10 c=>4 g=>1 t=>0
acaggtgacaaaaaa :: a=>9 c=>2 g=>3 t=>1
absolute diffs :: 1 2 2 1
sum of diffs :: 6
It would be wrong to assume an alphabet of 4 even for genomic data.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
BrowserUk,
I don't understand the example. Should that t => 0 and t => 1 be 1 not 2?
Update: I am not going to have a chance to play with any of my ideas so I will just share them here in case they are of any help. I was hoping that there would be a way to "cancel out terms" such that there was less work to do. The two ideas I had for that would be performing bitwise operations on the strings to find out which characters were in common and only counting the remaining ones. The second idea I had would be to process the string in chunks rather than characters. If I were trying to do a generic solution though - I would like go with Inline::C (array vs hash) incrementing values for the first string, decrementing values for the second string and summing the 256 indices for the result in the end.
| [reply] |
|
|
|
|
How about count others (not agct) as an exception?
my $aa="AGCTAAABBBCCC";
my %k=(a=>"a",g=>"g",c=>"c",t=>"t",else=>"[^agct]");
my %all;
while( my($k,$pattern)=each %k ){
$all{$k}++ while ($aa =~ m/$pattern/gi);
}
| [reply] [d/l] |
Re: The sum of absolute differences in the counts of chars in two strings.
by aaron_baugher (Curate) on Nov 19, 2011 at 19:21 UTC
|
I don't suppose there's anything particularly creative about my attempts, and I'm sure they won't compare to the speed of C. But I've been looking for a reason to learn to use Benchmark, so I thought I'd give this a try.
I tried three methods: counting the characters with tr inside an eval, counting with s///g, and splitting the strings into arrays and using grep. Somewhat surprisingly (to me, anyway, as someone who very rarely uses eval), the eval/tr method was by far the slowest. I assume that's from the overhead of eval, but as far as I know, that's necessary, since the variable has to be interpolated. Perhaps it would be faster, as someone upthread suggested, to put the whole thing into a single eval, rather than 52 separate eval calls, but I'm too attached to keeping eval usage as minimal as possible to feel good about that. The grep method was better, but not great, even though I split the strings outside the test. The easy winner was using an s///g regex to count the replacements like tr, but without the need for eval.
abaugher@bannor:~/work/perl/monks$ cat 938978.pl
#!/usr/bin/perl
use Modern::Perl;
use Benchmark qw(:all);;
my( $aa, $bb ) = ('','');
$aa .= chr(65 + rand(26)) for (1..50);
$bb .= chr(65 + rand(26)) for (1..50);
say $aa;
say $bb;
my @aa = split //, $aa;
my @bb = split //, $bb;
cmpthese(100000, {
'grep' => \&usinggrep,
'tr' => \&usingtr,
'sg' => \&usingsg,
});
sub usinggrep {
my $sum = 0;
for my $l ('A'..'Z') {
my $acount = grep /$l/, @aa;
my $bcount = grep /$l/, @bb;
$sum += abs($acount-$bcount);
}
return $sum;
}
sub usingtr {
my $sum = 0;
for my $l ('A'..'Z') {
my $acount = eval "'$aa' =~ tr[$l][$l]";
my $bcount = eval "'$bb' =~ tr[$l][$l]";
$sum += abs($acount-$bcount);
}
return $sum;
}
sub usingsg {
my $sum = 0;
for my $l ('A'..'Z') {
my $acount = $aa =~ s[$l][$l]g || 0;
my $bcount = $bb =~ s[$l][$l]g || 0;
$sum += abs($acount-$bcount);
}
return $sum;
}
abaugher@bannor:~/work/perl/monks$ perl 938978.pl
FHORAICOBMUCUMNYLRCUMVAMGXRRCADIZVTZTRENIEOBGNXSQT
JIPPTFERERTBOQOPNQSGWDTTOZOTHNXPEKJACSXEQBOAPIMSHI
Rate tr grep sg
tr 924/s -- -45% -88%
grep 1682/s 82% -- -79%
sg 7849/s 749% 367% --
| [reply] [d/l] |
|
|
to put the whole thing into a single eval
You should. Try this, on my machine, this beats your grep solution:
sub usingtr_1e {
eval join '+',
map qq{abs(('$aa' =~ y/$_//)-('$bb' =~ y/$_//))},
'A' .. 'Z';
}
| [reply] [d/l] |
|
|
You forgot about the match operator:
sub match {
my $sum = 0;
for my $l ('A'..'Z') {
my $acount = 0;
++$acount while($aa =~ /$l/g);
my $bcount = 0;
++$bcount while($bb =~ /$l/g);
$sum += abs($acount-$bcount);
}
return $sum;
}
| [reply] [d/l] |
|
|
sub usingtr_d {
my $a = $aa;
my $b = $bb;
# Delete ACGT first, assuming these are the
# most common characters.
abs($a =~ y/A//d - $b =~ y/A//d) +
abs($a =~ y/C//d - $b =~ y/C//d) +
abs($a =~ y/G//d - $b =~ y/G//d) +
abs($a =~ y/T//d - $b =~ y/T//d) +
abs($a =~ y/B// - $b =~ y/B//) +
abs($a =~ y/D// - $b =~ y/D//) +
abs($a =~ y/E// - $b =~ y/E//) +
abs($a =~ y/F// - $b =~ y/F//) +
abs($a =~ y/H// - $b =~ y/H//) +
abs($a =~ y/I// - $b =~ y/I//) +
abs($a =~ y/J// - $b =~ y/J//) +
abs($a =~ y/K// - $b =~ y/K//) +
abs($a =~ y/L// - $b =~ y/L//) +
abs($a =~ y/M// - $b =~ y/M//) +
abs($a =~ y/N// - $b =~ y/N//) +
abs($a =~ y/O// - $b =~ y/O//) +
abs($a =~ y/P// - $b =~ y/P//) +
abs($a =~ y/Q// - $b =~ y/Q//) +
abs($a =~ y/R// - $b =~ y/R//) +
abs($a =~ y/S// - $b =~ y/S//) +
abs($a =~ y/U// - $b =~ y/U//) +
abs($a =~ y/V// - $b =~ y/V//) +
abs($a =~ y/W// - $b =~ y/W//) +
abs($a =~ y/X// - $b =~ y/X//) +
abs($a =~ y/Y// - $b =~ y/Y//) +
abs($a =~ y/Z// - $b =~ y/Z//);
}
Benchmarks, including roboticus' robo_1, trizen's
match, and choroba's tr_1e:
Rate tr grep tr_1e sg match robo_1 tr_d
tr 843/s -- -18% -56% -75% -80% -92% -99%
grep 1022/s 21% -- -47% -70% -76% -90% -99%
tr_1e 1915/s 127% 87% -- -44% -55% -81% -98%
sg 3419/s 306% 235% 79% -- -20% -66% -97%
match 4291/s 409% 320% 124% 25% -- -57% -96%
robo_1 9984/s 1085% 877% 421% 192% 133% -- -90%
tr_d 100985/s 11881% 9784% 5175% 2853% 2254% 911% --
Tested against these strings, which are ~80% ACGT:
my @alpha = ('A'..'Z', qw[ A C G T ] x 20);
my $aa = join '', map $alpha[rand @alpha], 1..100;
my $bb = join '', map $alpha[rand @alpha], 1..100;
| [reply] [d/l] [select] |
Re: The sum of absolute differences in the counts of chars in two strings.
by bluescreen (Friar) on Nov 19, 2011 at 17:08 UTC
|
Are the strings likely to be equals ( or almost ) ? If so you could use a moving pointers and only do the counting if the characters at the pointer's position differ saving some cycles.
| [reply] |
Re: The sum of absolute differences in the counts of chars in two strings.
by locked_user sundialsvc4 (Abbot) on Nov 19, 2011 at 21:56 UTC
|
It just seems to me, and please don’t take this glibly or wrongly, that the first order of business would be to try to guess ... what would actually slow it down? I can’t think of anything...
If the set of characters is fixed, then a fixed-size array will accumulate counts with zero difference of timing of one character versus another, and there is zero possibility of page-fault activity here. Likewise, I can scarce imagine how the time required to simply loop through the entire arrays, each and every time, to calculate the differences could ever be objectionable ... or for that matter, varying.
I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time. A completely dumb, completely brute-force solution is going to crunch through any set of strings presented to it with an unchanging (and extremely rapid) cycle-time per character no matter what the data turns out to be.
In short, my initial reaction is that the most-obvious way is the best way. I genuinely suspect that any deviation from this straight-path just might wind up being significantly slower.
| |
|
|
I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time.
Time and again here, one or more of the monks have spotted a solution to a problem that at first sight appear intractable to optimisation.
For example: in Comparing a large set of DNA sequences, the OP describes an apparently simple process of cross comparing 100,000 x 20-char strings, as "It takes on the orders of days to process".
Various approaches to this problem have been suggested down the years including the use of any of several well-known fuzzy matching and edit distance algorithms, like String::Approx, Text::Levenshtien, Text::Soundex Text::WagnerFischer and others. But these brute-force, char-by-char O(M*N) comparisons are horrible expensive -- even when coded in XS -- and often not so useful for the desired results either, as edit-distances are a quite different metric to the required: 'are these two strings the same except for in at most n places'.
Another frequently offered alternative is the use of regexes, perhaps programmically generated. Whilst these work fine and beat the edit-distance algorithms hands down, they generally require quite-to-very complex regexes that, by requirement, contain many backtrack points, with the result that their runtime performance is understandably very poor.
The root of the problem is the O(N2/2) nature of the round-robin comparison process. The exponential growth in the numbers of comparisons required, means that even for relatively low numbers of strings -- and in genomic work, 100,000 is low -- every extra microsecond taken performing each comparison adds an hour and a half to the processing time.
And by the time you get to a still relatively small 1 million strings, each extra microsecond will cost you an extra 138 hours. Almost an extra week of waiting, not to mention the ~15kWhs of electricity used in the process.
It would likely not be at all obvious to you that using string-wise XOR would form the basis of the most efficient (pure Perl) method of doing this, but its use here reduces days of processing to a few hours.
Even less obvious is the likelihood that anyone would offer a solution that reduces the number of comparisons required from 5 billion fuzzy matches to the order of 300,000 O(1) lookups, yet here it is.
Whilst my question has yet to garner anything other than more efficiently implemented brute-force algorithms -- themselves very valuable in the absence of a different solution -- there are still at least a couple of avenues yet unexplored.
So, whilst I couldn't think of any game changing algorithm or approach to the problem, I'm long enough in the tooth to know that it does no harm to ask and can often lead to surprises.
So, you'll forgive me -- or not -- if I file your "I can’t think of anything.. non-response, under the heading of: "Saw the opportunity to garner a few XP, but couldn't think of anything useful to say, so I'll say just that, and wrap it in some other plausible but meaningless drivel."
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
Omitting ... please!! ... that final paragraph from your post, which was entirely unnecessary, I found your comments up to that point extremely interesting, and I immediately followed and read both the link that you gave in the second-from-last paragraph and the entire thread.
Please enlighten me here ... I am quite, quite serious ... what is it about your original description of the problem that I have overlooked? If the “alphabet” were a normal one, say no more than 256 elements, then we will have a 256-element array set to all-zero and we will tally each character with zero variance in time based on character value. Then we will loop through exactly 256 accumulators, adding them up and setting them back to zero as we go. And then we will have our answer.
As do many of us, I have plenty of experience with (differing kinds of) “massive problems” in which very slight gains, repeated hundreds of millions of times if not many times more, make all the difference in the world. Which, having said that, “is neither here nor there.” So, I fully appreciate what you are saying right down to and including “15 kWh of electricity,” but I simply do not yet see what makes this particular problem “hard.”
I know that it must be, or you would not have said it. (I would be an idiot to question your knowledge or experience.) But something in the description, I guess, eluded me. What did I overlook? I really do want to know.
| |
|
|
|
|
|
Re: The sum of absolute differences in the counts of chars in two strings.(Results)
by BrowserUk (Patriarch) on Nov 21, 2011 at 17:28 UTC
|
Here's my benchmark of a couple of PP and several I::C versions based (loosely) on the ideas posted in the thread along with my chosen I::C version which gains a little performance by using chars for the counting as 127 chars is a reasonable maximum length. And gains a little more by restricting the alphabet to printable ascii chars only, which is also reasonable for the envisaged use. The benchmark results are:
Rate PP trg mb Pb JF me
PP 0.168/s -- -92% -96% -97% -97% -97%
trg 2.14/s 1170% -- -55% -58% -59% -68%
mb 4.72/s 2703% 121% -- -8% -10% -30%
Pb 5.13/s 2948% 140% 9% -- -3% -23%
JF 5.27/s 3031% 147% 12% 3% -- -21%
me 6.70/s 3881% 213% 42% 31% 27% --
And the code: #! perl -slw
use strict;
use Inline C => Config => BUILD_NOISY => 1;
use Inline C => <<'END_C', NAME => 'deltasum', CLEAN_AFTER_BUILD => 0
+;
// Assumes equal length ascii strings and no embedded nulls.
int deltasum( char *a, char *b ) {
signed char counts[256] = { 0, };
int i, sum = 0;
while( *a && *b ) ++counts[ *a++ ], --counts[ *b++ ];
for( i = 32; i < 127; ++i ) sum += abs( counts[ i ] );
return sum;
}
int Perlbotics(char* a, char* b) {
int counter_tab[256] = { 0 };
int i;
int sum = 0;
for ( ; *a ; ++a ) ++counter_tab[ *a ];
for ( ; *b ; ++b ) --counter_tab[ *b ];
for ( i = 0; i < 256; ++i ) { /* maybe reduced to significant range?
+ */
sum += abs( counter_tab[i] );
}
return sum;
}
int mbethke(char *s, char *t) {
int i, sum=0, freqs[256] = { 0 };
for(i=0; i<strlen(s); ++i) freqs[s[i]]++;
for(i=0; i<strlen(t); ++i) freqs[t[i]]--;
for(i=0; i<256; ++i) sum += abs(freqs[i]);
return sum;
}
int JavaFan(char *s, char *t) {
int i, sum=0, freqs[256] = { 0 };
while (*s) {freqs[*s++]++;}
while (*t) {freqs[*t++]--;}
for(i=0; i<256; ++i) sum += abs(freqs[i]);
return sum;
}
END_C
use Data::Dump qw[ pp ];
use Benchmark qw[ cmpthese ];
sub deltaSum {
my( $sum, @counts ) = 0;
++$counts[ $_ ] for unpack 'C*', $_[0];
--$counts[ $_ ] for unpack 'C*', $_[1];
$sum += abs( $counts[ $_ ] // 0 ) for 32 .. 127;
return $sum;
}
sub genTr {
my $code = join "+", map{
"abs( \$_[0] =~ tr[$_][$_] - \$_[1] =~ tr[$_][$_] )"
} @_;
eval "sub{ $code; }";
}
chomp( our @s = <DATA> );
our $I //= -1;
cmpthese $I, {
PP => q[
my $total = 0;
for my $i ( 0 .. $#s ) {
$total += deltaSum( @s[ $i, $_ ] ) for $i+1 .. $#s;
}
print "deltaSum : $total" if $I == 1;
],
trg => q[
my $total = 0;
my $sub = genTr( qw[a c g t ] );
for my $i ( 0 .. $#s ) {
$total += $sub->( @s[ $i, $_ ] ) for $i+1 .. $#s;
}
print "tr_g : $total" if $I == 1;
],
me => q[
my $total = 0;
for my $i ( 0 .. $#s ) {
$total += deltasum( @s[ $i, $_ ] ) for $i+1 .. $#s;
}
print "deltasum : $total" if $I == 1;
],
Pb => q[
my $total = 0;
for my $i ( 0 .. $#s ) {
$total += Perlbotics( @s[ $i, $_ ] ) for $i+1 .. $#s;
}
print "Perbotics : $total" if $I == 1;
],
mb => q[
my $total = 0;
for my $i ( 0 .. $#s ) {
$total += mbethke( @s[ $i, $_ ] ) for $i+1 .. $#s;
}
print "mbethke : $total" if $I == 1;
],
JF => q[
my $total = 0;
for my $i ( 0 .. $#s ) {
$total += JavaFan( @s[ $i, $_ ] ) for $i+1 .. $#s;
}
print "JavaFan : $total" if $I == 1;
],
};
__DATA__
aaaaaaaaaaacccg
aaaaaaaaaaccccg
aaaaaaacccccccc
aaaaaaaaacccccg
aaaaaaaaaaaccgg
aaaaaaaaaacccgg
aaaaaaacccccccg
aaaaaaaaaccccgg
aaaaaaaaaaccggg
aaaaaaaacccccgg
aaaaaaccccccccg
aaaaaaaaacccggg
aaaaacccccccccg
aaaaaaaccccccgg
aaaaccccccccccg
aaaaaaaaccccggg
aaaaaaaaaacgggg
aaaaaacccccccgg
aaaaaaaaaacccct
aaaaaaaaaccgggg
aaaaaccccccccgg
aaaaaaacccccggg
aaaaaaaaaccccct
aaaaaaaaaaaccgt
aaaacccccccccgg
aaaaaaaacccgggg
aaaaaaccccccggg
aaaaaaaacccccct
aaaaaaaaaaaaggt
aaaaaaaaaacccgt
aaaccccccccccgg
aaaaacccccccggg
aaaaaaaaacggggg
aaaaaaaccccgggg
aaaaaaaccccccct
aaaaaaaaaccccgt
aaaaaaaaaaacggt
aaaaaaaaccggggg
aaaaccccccccggg
aaaaaacccccgggg
aaaaaacccccccct
aaaaaaaacccccgt
aaaaaaaaaaccggt
aaaaaccccccgggg
aaacccccccccggg
aaaaaaacccggggg
aaaaaaaccccccgt
aaaaaccccccccct
aaaaaaaaacccggt
aaaaaaaaaaagggt
aaaaaaaacgggggg
aaaacccccccgggg
aaccccccccccggg
aaaaaaccccggggg
aaaaaaaaccccggt
aaaaaaaaaacgggt
aaaaaacccccccgt
aaaacccccccccct
aaaaacccccggggg
aaaaaaaccgggggg
aaaccccccccgggg
aaaaaccccccccgt
aaaaaaacccccggt
aaaaaaaaaccgggt
aaaaccccccggggg
aaaaaaaaggggggg
aacccccccccgggg
aaaaaacccgggggg
aaaaaaaacccgggt
aaaaaaaaaaggggt
aaaacccccccccgt
aaaaaaccccccggt
aacccccccccccct
aaacccccccggggg
accccccccccgggg
aaaaaccccgggggg
aaaaaaaaaaccctt
aaaaaaacggggggg
aaaaacccccccggt
aaaaaaaccccgggt
aaaccccccccccgt
aaaaaaaaacggggt
aaccccccccggggg
aaaaaaaaacccctt
aaaacccccgggggg
aaaaaaccggggggg
aaaaaaaaaaacgtt
aacccccccccccgt
aaaaaacccccgggt
aaaaaaaaccggggt
aaaaccccccccggt
aaaaaaaaccccctt
aaaaaaagggggggg
acccccccccggggg
aaaccccccgggggg
aaaaacccggggggg
aaaaaaaaaaccgtt
aaacccccccccggt
aaaaaaacccggggt
aaaaaccccccgggt
aaaaaaaaagggggt
aaaaaaacccccctt
aacccccccgggggg
aaaaaacgggggggg
aaaaaaaaacccgtt
aaaaaaaaaaaggtt
aaaaccccggggggg
aaccccccccccggt
aaaaaaccccggggt
aaaaaaaacgggggt
aaaacccccccgggt
aaaaaaaaccccgtt
aaaaaaaaaacggtt
accccccccgggggg
aaaaaccgggggggg
aaacccccggggggg
aaaaaaccccccctt
aaaccccccccgggt
acccccccccccggt
aaaaacccccggggt
aaaaaaaccgggggt
aaaacccgggggggg
aaaaaaacccccgtt
aaaaaaaaaccggtt
aaaaaaggggggggg
cccccccccgggggg
aaccccccggggggg
aaaaacccccccctt
aaaaaacccgggggt
aaaaaaaaggggggt
aaaaccccccggggt
aacccccccccgggt
acccccccggggggg
aaaccccgggggggg
aaaaaaaaaagggtt
aaaaaaaacccggtt
aaaaaaccccccgtt
aaaaacggggggggg
aaaaccccccccctt
aaaaaccccgggggt
aaacccccccggggt
accccccccccgggt
aaaaaaacggggggt
aaaaacccccccgtt
aaaaccggggggggg
aacccccgggggggg
aaacccccccccctt
aaaaaaaaacgggtt
ccccccccggggggg
aaaaaaaccccggtt
aaaaaaaaaaacttt
aaccccccccggggt
aaaaaaccggggggt
aaaacccccgggggt
aaacccggggggggg
accccccgggggggg
aaaaccccccccgtt
aaccccccccccctt
aaaaaaaaccgggtt
aaaaaacccccggtt
aaaaaaaaaaccttt
aaaaacccggggggt
aaaaaaagggggggt
aaaccccccgggggt
acccccccccggggt
aaccccggggggggg
aaaacgggggggggg
aaaaaccccccggtt
aaacccccccccgtt
aaaaaaacccgggtt
cccccccgggggggg
aaaaaaaaaggggtt
aaaaaaaaacccttt
aaaaccccggggggt
aaaaaacgggggggt
aacccccccgggggt
aaccccccccccgtt
aaaacccccccggtt
aaaaaaaacggggtt
aaaccgggggggggg
aaaaaaccccgggtt
acccccggggggggg
aaaaaaaaaacgttt
aaaaaaaaccccttt
aaaaaccgggggggt
aaacccccggggggt
accccccccgggggt
aaaaacccccgggtt
acccccccccccgtt
aaaaaaaccggggtt
aaaccccccccggtt
aacccgggggggggg
aaaaaaaaaccgttt
aaccccccggggggt
cccccccccgggggt
aaaaaaggggggggt
aaaaaaacccccttt
aaaacccgggggggt
accccgggggggggg
aacccccccccggtt
aaaaaacccggggtt
aaaaccccccgggtt
aaaaaaaagggggtt
aaaaaaccccccttt
aaaaaaaaaaggttt
aaaaacggggggggt
aaaccccgggggggt
aaaaaaaacccgttt
acccccccggggggt
aaaaaccccggggtt
aaaaaaacgggggtt
aaccggggggggggg
accccccccccggtt
aaacccccccgggtt
aaaaaaaccccgttt
ccccccccggggggt
aaaaccggggggggt
aacccccgggggggt
aaaaaaaaacggttt
aaaaacccccccttt
aaccccccccgggtt
cccccccccccggtt
aaaaaaccgggggtt
aaaacccccggggtt
aaacccggggggggt
aaaaagggggggggt
accccccgggggggt
aaaaaacccccgttt
aaaaaaaaccggttt
aaaaccccccccttt
aaaccccccggggtt
acccccccccgggtt
aaaaacccgggggtt
aaaaaaaggggggtt
cccccccgggggggt
aaaaaaacccggttt
aaaaaaaaagggttt
aaaaaccccccgttt
aaacccccccccttt
aaaacgggggggggt
aaccccggggggggt
aaaaccccgggggtt
aaaaaacggggggtt
ccccccccccgggtt
aacccccccggggtt
aaaccgggggggggt
aaccccccccccttt
aaaaaaaacgggttt
aaaacccccccgttt
acccccggggggggt
aaaaaaccccggttt
aaaaaaaaaactttt
aaaaaccggggggtt
accccccccggggtt
aaacccccgggggtt
aaaccccccccgttt
aaaaaaaccgggttt
aaaaggggggggggt
aaaaacccccggttt
aacccgggggggggt
ccccccggggggggt
cccccccccggggtt
aaaaaagggggggtt
aaccccccgggggtt
aaaaaaaaacctttt
aaaacccggggggtt
aaaaccccccggttt
aaaaaacccgggttt
aacccccccccgttt
aaacggggggggggt
aaaaaaaaggggttt
accccgggggggggt
aaaaaaaaaagtttt
aaaccccggggggtt
aaaaaaaaccctttt
acccccccgggggtt
aaaaacgggggggtt
aaaaaccccgggttt
aaccggggggggggt
accccccccccgttt
aaacccccccggttt
aaaaaaacggggttt
ccccccccgggggtt
aaaaaaacccctttt
aaaaccgggggggtt
aacccccggggggtt
aaaaaaaaacgtttt
aaaacccccgggttt
aaaaaaccggggttt
acccggggggggggt
aaccccccccggttt
aaaaaaccccctttt
aaaaaggggggggtt
aaaaaaaaccgtttt
accccccggggggtt
aaacccgggggggtt
acccccccccggttt
aaaaacccggggttt
aaaccccccgggttt
aaaaaaagggggttt
ccccggggggggggt
aacgggggggggggt
cccccccggggggtt
aaaaaaaaaggtttt
aaaacggggggggtt
aaaaacccccctttt
aaccccgggggggtt
aaaaaaacccgtttt
aacccccccgggttt
aaaaaacgggggttt
ccccccccccggttt
aaaaccccggggttt
aaaaaaaacggtttt
aaaccggggggggtt
aaaaccccccctttt
aaaaaaccccgtttt
acccccgggggggtt
aaaaaccgggggttt
aaacccccggggttt
accccccccgggttt
aaaaaaaccggtttt
aaacccccccctttt
aaaaacccccgtttt
ccccccgggggggtt
aaaagggggggggtt
aacccggggggggtt
acggggggggggggt
aaaaaaggggggttt
aaaacccgggggttt
cccccccccgggttt
aaccccccggggttt
accccggggggggtt
aaccccccccctttt
aaaaaaaagggtttt
aaaaccccccgtttt
aaaaaacccggtttt
aaacgggggggggtt
aaaccccgggggttt
acccccccggggttt
aaaaacggggggttt
aaaaaaacgggtttt
aaaaaccccggtttt
aaccgggggggggtt
cccccggggggggtt
aaacccccccgtttt
acccccccccctttt
aaaaccggggggttt
aacccccgggggttt
ccccccccggggttt
aaaaaaaaacttttt
aaccccccccgtttt
acccgggggggggtt
ccccccccccctttt
aaaaaaccgggtttt
aaaacccccggtttt
aaaggggggggggtt
aaaaaaaaccttttt
aaaaagggggggttt
aaacccggggggttt
accccccgggggttt
acccccccccgtttt
aacggggggggggtt
ccccgggggggggtt
aaaaacccgggtttt
aaaaaaaggggtttt
aaaccccccggtttt
aaccccggggggttt
cccccccgggggttt
aaaaaaaaagttttt
aaaacgggggggttt
aaaaaaacccttttt
ccccccccccgtttt
accggggggggggtt
aacccccccggtttt
aaaaaacggggtttt
aaaaccccgggtttt
aaaaaaaacgttttt
acccccggggggttt
aaaccgggggggttt
aaaaaaccccttttt
accccccccggtttt
aagggggggggggtt
aaacccccgggtttt
aaaaaccggggtttt
aacccgggggggttt
aaaaggggggggttt
aaaaacccccttttt
ccccccggggggttt
aaaaaaaccgttttt
aaaaaagggggtttt
acgggggggggggtt
aaccccccgggtttt
cccccccccggtttt
aaaacccggggtttt
aaaaccccccttttt
accccgggggggttt
aaaaaaaaggttttt
aaaaaacccgttttt
aaacggggggggttt
ccgggggggggggtt
acccccccgggtttt
aaaccccggggtttt
aaaaacgggggtttt
aaaaaccccgttttt
aaacccccccttttt
cccccgggggggttt
aaccggggggggttt
aaaaaaacggttttt
aacccccggggtttt
aaaaccgggggtttt
ccccccccgggtttt
acccggggggggttt
aaaacccccgttttt
aaccccccccttttt
aaaaaaccggttttt
aaagggggggggttt
aaacccgggggtttt
accccccggggtttt
aaaaaggggggtttt
ccccggggggggttt
aaaccccccgttttt
acccccccccttttt
aaaaaaagggttttt
aacgggggggggttt
aaaaacccggttttt
aaaacggggggtttt
aaccccgggggtttt
cccccccggggtttt
aaaaccccggttttt
accgggggggggttt
ccccccccccttttt
aaaaaacgggttttt
aacccccccgttttt
acccccgggggtttt
aaaaaaaactttttt
aaaccggggggtttt
aaacccccggttttt
cccgggggggggttt
aaggggggggggttt
accccccccgttttt
aaaaaccgggttttt
aaaagggggggtttt
aacccggggggtttt
aaaaaaacctttttt
ccccccgggggtttt
aaaaaaggggttttt
cccccccccgttttt
acggggggggggttt
aaccccccggttttt
aaaacccgggttttt
accccggggggtttt
aaacgggggggtttt
aaaaaaaagtttttt
aaaaaaccctttttt
acccccccggttttt
ccggggggggggttt
aaaaacggggttttt
aaaccccgggttttt
aaaaaaacgtttttt
aaaaacccctttttt
cccccggggggtttt
aaccgggggggtttt
aaaaccggggttttt
ccccccccggttttt
aacccccgggttttt
aaaggggggggtttt
aaaaccccctttttt
acccgggggggtttt
aaaaaaccgtttttt
aaaaagggggttttt
cgggggggggggttt
accccccgggttttt
aaacccggggttttt
aaacccccctttttt
aacggggggggtttt
aaaaacccgtttttt
ccccgggggggtttt
aaaaaaaggtttttt
cccccccgggttttt
aaccccggggttttt
aaaacgggggttttt
aaccccccctttttt
aaaaaacggtttttt
accggggggggtttt
aaaaccccgtttttt
aaaccgggggttttt
acccccggggttttt
acccccccctttttt
aagggggggggtttt
aaaaaccggtttttt
cccggggggggtttt
aaacccccgtttttt
aacccgggggttttt
aaaaggggggttttt
ccccccggggttttt
aaaaaagggtttttt
aaaacccggtttttt
acgggggggggtttt
aaccccccgtttttt
aaaaaaaattttttt
aaacggggggttttt
accccgggggttttt
aaaccccggtttttt
ccgggggggggtttt
acccccccgtttttt
aaaaacgggtttttt
aaccggggggttttt
aaaaaaacttttttt
cccccgggggttttt
aaaaccgggtttttt
ccccccccgtttttt
aggggggggggtttt
aacccccggtttttt
aaaaaaccttttttt
acccggggggttttt
aaagggggggttttt
cggggggggggtttt
aaaaaggggtttttt
aaacccgggtttttt
accccccggtttttt
aacgggggggttttt
ccccggggggttttt
aaaaacccttttttt
aaaaaaagttttttt
aaccccgggtttttt
cccccccggtttttt
aaaacggggtttttt
aaaaaacgttttttt
accgggggggttttt
aaaaccccttttttt
acccccgggtttttt
aaaccggggtttttt
cccgggggggttttt
aaaaaccgttttttt
aaggggggggttttt
aaacccccttttttt
ccccccgggtttttt
aacccggggtttttt
aaaagggggtttttt
aaaacccgttttttt
aaaaaaggttttttt
acggggggggttttt
aaccccccttttttt
accccggggtttttt
aaacgggggtttttt
aaaaacggttttttt
ccggggggggttttt
aaaccccgttttttt
acccccccttttttt
cccccggggtttttt
aaccgggggtttttt
aaaaccggttttttt
agggggggggttttt
ccccccccttttttt
aacccccgttttttt
aaaggggggtttttt
acccgggggtttttt
cgggggggggttttt
aaacccggttttttt
accccccgttttttt
aaaaagggttttttt
ccccgggggtttttt
aacggggggtttttt
aaaaaaatttttttt
cccccccgttttttt
aaccccggttttttt
aaaacgggttttttt
accggggggtttttt
aaaaaactttttttt
ggggggggggttttt
acccccggttttttt
aaaccgggttttttt
cccggggggtttttt
aaaaacctttttttt
aagggggggtttttt
aaaaggggttttttt
ccccccggttttttt
aacccgggttttttt
aaaaaagtttttttt
acgggggggtttttt
aaaaccctttttttt
aaacggggttttttt
accccgggttttttt
ccgggggggtttttt
aaaaacgtttttttt
aaacccctttttttt
aaccggggttttttt
cccccgggttttttt
aaaaccgtttttttt
aaccccctttttttt
aggggggggtttttt
acccggggttttttt
aaagggggttttttt
aaacccgtttttttt
cggggggggtttttt
acccccctttttttt
aaaaaggtttttttt
ccccggggttttttt
aacgggggttttttt
aaaacggtttttttt
aaccccgtttttttt
ccccccctttttttt
accgggggttttttt
aaaccggtttttttt
gggggggggtttttt
acccccgtttttttt
aaggggggttttttt
cccgggggttttttt
aaaagggtttttttt
aacccggtttttttt
ccccccgtttttttt
acggggggttttttt
accccggtttttttt
aaacgggtttttttt
aaaaacttttttttt
ccggggggttttttt
cccccggtttttttt
aaccgggtttttttt
aaaaccttttttttt
agggggggttttttt
acccgggtttttttt
aaaggggtttttttt
aaaaagttttttttt
aaacccttttttttt
cgggggggttttttt
ccccgggtttttttt
aacggggtttttttt
aaaacgttttttttt
aaccccttttttttt
accggggtttttttt
aaaccgttttttttt
acccccttttttttt
aagggggtttttttt
cccggggtttttttt
aacccgttttttttt
aaaaggttttttttt
acgggggtttttttt
aaacggttttttttt
accccgttttttttt
ccgggggtttttttt
cccccgttttttttt
aaccggttttttttt
aggggggtttttttt
aaagggttttttttt
acccggttttttttt
cggggggtttttttt
aacgggttttttttt
ccccggttttttttt
aaaactttttttttt
accgggttttttttt
aaacctttttttttt
gggggggtttttttt
aaggggttttttttt
cccgggttttttttt
aaaagtttttttttt
aaccctttttttttt
acggggttttttttt
aaacgtttttttttt
acccctttttttttt
ccggggttttttttt
aaccgtttttttttt
agggggttttttttt
acccgtttttttttt
aaaggtttttttttt
aacggtttttttttt
accggtttttttttt
ggggggttttttttt
aagggtttttttttt
cccggtttttttttt
acgggtttttttttt
aaacttttttttttt
ccgggtttttttttt
aaccttttttttttt
aggggtttttttttt
cggggtttttttttt
aacgttttttttttt
accgttttttttttt
gggggtttttttttt
aaggttttttttttt
ccggttttttttttt
cgggttttttttttt
acctttttttttttt
ggggttttttttttt
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: The sum of absolute differences in the counts of chars in two strings.
by TJPride (Pilgrim) on Nov 24, 2011 at 16:47 UTC
|
Assuming the strings are fairly short and contain a limited number of unique characters, which is likely in this case:
print diffCount('aaaaacaacaaagcc', 'acaggtgacaaaaaa');
sub diffCount {
my (%counts, $sum, $i);
$counts{substr($_[0], $_, 1)}++ for 0..(length($_[0]) - 1);
$counts{substr($_[1], $_, 1)}-- for 0..(length($_[1]) - 1);
$sum += abs $_ for values %counts;
return $sum;
}
I tried other methods using split, regex, etc., but this benchmarked the fastest. Obviously, doing the same thing in C or C++ would be even faster. | [reply] [d/l] |