BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given two strings:

  1. count the number of each character within each.
  2. then find the absolute difference of the counts for each character.
  3. Then sum those differences.

This is not a difficult process. Split the strings into characters, use a couple of hashes or arrays to accumulate the counts. Iterate those and perform the subtractions. Sum the differences.

But I need to do this many times, so doing it efficiently is critical. No code to avoid influencing the creativity. Suggestions?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on The sum of absolute differences in the counts of chars in two strings.

Replies are listed 'Best First'.
Re: The sum of absolute differences in the counts of chars in two strings.
by Corion (Patriarch) on Nov 19, 2011 at 15:41 UTC

    If the used alphabet is known and most of the letters are likely to be used and the strings are not too long, I think that it would be worthwhile to investigate tr/// over split+count, potentially by generating one huge eval string:

    my $code = join "\n;\n", map {sprintf '$count{ $_ } = tr/%s/%s/', $_, +$_ } @alphabet; my %count; local $_ = $string1; eval $code;

    The second string counting could be rolled into a second eval:

    my $code2 = join "\n+\n", map {sprintf 'abs($count{ %s } - tr/%s/%s/)' +, $_, $_, $_ } @alphabet; local $_ = $string2; eval $code2;

    But as the tr-approach reads the strings multiple times, this will really depend on the length of the strings and the size of the alphabet.

Re: The sum of absolute differences in the counts of chars in two strings.
by Perlbotics (Archbishop) on Nov 19, 2011 at 16:47 UTC

    Here's my brute-force try. Don't know if the call-overhead is bigger than the gain? I assumed a one-byte-is-one-character relationship (no wchar's yet). HTH

    use Inline C; use strict; use warnings; use Time::HiRes qw(time); my $str1 = "abcdefabc"; my $str2 = "xyzabcabc@@@"; my $now = time; my $res; $res = deltasum( $str1, $str2 ) for (1..1_000_000); print "res=$res time=", ( time() - $now ), "us\n"; #res=9 time=0.819153070449829us __END__ __C__ #define DEBUG(x) int counter_tab[256]; int deltasum(char* a, char* b) { int i; int sum = 0; DEBUG( printf("IN: (%d:%s) (%d:%s)\n", strlen(a), a, strlen(b), b); +) bzero( counter_tab, sizeof( counter_tab ) ); for ( ; *a ; ++a ) ++counter_tab[ *a ]; for ( ; *b ; ++b ) --counter_tab[ *b ]; for ( i = 0; i < 256; ++i ) { /* maybe reduced to significant range? + */ sum += abs( counter_tab[i] ); DEBUG( if ( counter_tab[i] ) printf( "'%c' (%3d) x %5d\n", i, i, c +ounter_tab[i]); ) } DEBUG( printf("OUT: %d\n", sum); ) return sum; }
    Update: Moved i and sum to head of function to cope with ancient compilers (thanks to davido). Simplified sizeof() expression.

    Debug sample:

    IN: (9:abcdefabc) (12:xyzabcabc@@@) '@' ( 64) x -3 'd' (100) x 1 'e' (101) x 1 'f' (102) x 1 'x' (120) x -1 'y' (121) x -1 'z' (122) x -1 OUT: 9

    Update: Approx. 60% speed-up when limiting to [ACGTacgt], still room for improvement...

Re: The sum of absolute differences in the counts of chars in two strings.
by roboticus (Chancellor) on Nov 19, 2011 at 17:31 UTC

    BrowserUk:

    Something like this?

    $ cat 938978.pl #!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; my ($x, $y) = split /\s+/,$_; last if ! defined $y; print robo_1($x,$y), " <$x> <$y>\n"; } sub robo_1 { my %h; $h{$_}++ for split//,shift; $h{$_}-- for split//,shift; my $t=0; $t += ($_>0 ? $_ : -$_) for values %h; return $t; } __DATA__ 123456 abcdefghijkl ARF_ARF_ARF BARKBARKBARK doom gloom heartbreak anguish $ perl 938978.pl 18 <123456> <abcdefghijkl> 11 <ARF_ARF_ARF> <BARKBARKBARK> 3 <doom> <gloom> 13 <heartbreak> <anguish>

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: The sum of absolute differences in the counts of chars in two strings.
by mbethke (Hermit) on Nov 19, 2011 at 18:03 UTC
    Now I just saw Perlbotics has posted an Inline C solution already but anyway, this is a bit simpler:
    use Inline C; print char_freq_diff_sum("123456", "abcdefghijkl"),"\n"; __DATA__ __C__ #define abs(x) (x<0?-x:x) /* no need for math.h */ int char_freq_diff_sum(char *s, char *t) { int i, sum=0, freqs[256]={0}; for(i=0; i<strlen(s); ++i) freqs[s[i]]++; for(i=0; i<strlen(t); ++i) freqs[t[i]]--; for(i=0; i<256; ++i) sum += abs(freqs[i]); return sum; }
    Of course it makes the same dangerous assumptions like 8-bit character set. Unicode from UCS2 should be possible using the same principle though, a couple orders of magnitude slower but still much faster than pure Perl.
      Do note that in C, strlen isn't a constant time operation - it walks the string looking for a NUL byte. So, for speed issues, one either caches the result, or does away with the index altogether:
      while (*s) {freq[*s++]++;} while (*t) {freq[*t++]--;}
      Your freqs[256]={0}; is unfamiliar syntax to me. Some gcc modernism to initialize the array?
        int freqs[256]={0};

        I think that syntax is part of C99, inherited back as a "good idea" from C++. If the initializer list in curly brackets is shorter than the number of elements in the entity being initialized, all elements not enumerated in the list will initialize to zero. Ex: int freqs[256]={1}; would initialize as freqs[0] = 1, freqs[1] = 0, freqs[2] = 0, etc.

        gcc offers some additional features beyond that, but the syntax used here should be portable to any C99 implementation.


        Dave

        I'm not sure when this array initialization was standardized but since I learned C almost 20 years ago I haven't been treated to any compiler that didn't understand it.

        You've got a point about the strlen. Actually I had it in a separate function first that got a const char* string argument, in that case any optimizer worth its name should notice it's an invariant and do the caching transparently. I haven't looked at the compiled code so it may be that a non-const string would disable this optimization, although the code is simple enough that I'd think the optimizer would notice the string is never modified inside the loop and do it anyway.

Re: The sum of absolute differences in the counts of chars in two strings.
by Limbic~Region (Chancellor) on Nov 19, 2011 at 17:24 UTC
    BrowserUk,
    How long are the strings (min, max, avg)? Will they be the same length? Are the characters single byte (8 bits)? Assuming 1 byte characters, is the alphabet of possible characters restricted or can we assume all 256 possibilities?

    Also, can you give a manual example? I am concerned about misunderstanding the requirements.

    Cheers - L~R

      The strings could contain anything and be of any length, but the example I'm working with is genomic data and less than 100 chars.

      A worked example:

      aaaaacaacaaagcc :: a=>10 c=>4 g=>1 t=>0 acaggtgacaaaaaa :: a=>9 c=>2 g=>3 t=>1 absolute diffs :: 1 2 2 1 sum of diffs :: 6

      It would be wrong to assume an alphabet of 4 even for genomic data.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,
        I don't understand the example. Should that t => 0 and t => 1 be 1 not 2?

        Update: I am not going to have a chance to play with any of my ideas so I will just share them here in case they are of any help. I was hoping that there would be a way to "cancel out terms" such that there was less work to do. The two ideas I had for that would be performing bitwise operations on the strings to find out which characters were in common and only counting the remaining ones. The second idea I had would be to process the string in chunks rather than characters. If I were trying to do a generic solution though - I would like go with Inline::C (array vs hash) incrementing values for the first string, decrementing values for the second string and summing the 256 indices for the result in the end.

        Cheers - L~R

        How about count others (not agct) as an exception?

        my $aa="AGCTAAABBBCCC"; my %k=(a=>"a",g=>"g",c=>"c",t=>"t",else=>"[^agct]"); my %all; while( my($k,$pattern)=each %k ){ $all{$k}++ while ($aa =~ m/$pattern/gi); }
Re: The sum of absolute differences in the counts of chars in two strings.
by aaron_baugher (Curate) on Nov 19, 2011 at 19:21 UTC

    I don't suppose there's anything particularly creative about my attempts, and I'm sure they won't compare to the speed of C. But I've been looking for a reason to learn to use Benchmark, so I thought I'd give this a try.

    I tried three methods: counting the characters with tr inside an eval, counting with s///g, and splitting the strings into arrays and using grep. Somewhat surprisingly (to me, anyway, as someone who very rarely uses eval), the eval/tr method was by far the slowest. I assume that's from the overhead of eval, but as far as I know, that's necessary, since the variable has to be interpolated. Perhaps it would be faster, as someone upthread suggested, to put the whole thing into a single eval, rather than 52 separate eval calls, but I'm too attached to keeping eval usage as minimal as possible to feel good about that. The grep method was better, but not great, even though I split the strings outside the test. The easy winner was using an s///g regex to count the replacements like tr, but without the need for eval.

    abaugher@bannor:~/work/perl/monks$ cat 938978.pl #!/usr/bin/perl use Modern::Perl; use Benchmark qw(:all);; my( $aa, $bb ) = ('',''); $aa .= chr(65 + rand(26)) for (1..50); $bb .= chr(65 + rand(26)) for (1..50); say $aa; say $bb; my @aa = split //, $aa; my @bb = split //, $bb; cmpthese(100000, { 'grep' => \&usinggrep, 'tr' => \&usingtr, 'sg' => \&usingsg, }); sub usinggrep { my $sum = 0; for my $l ('A'..'Z') { my $acount = grep /$l/, @aa; my $bcount = grep /$l/, @bb; $sum += abs($acount-$bcount); } return $sum; } sub usingtr { my $sum = 0; for my $l ('A'..'Z') { my $acount = eval "'$aa' =~ tr[$l][$l]"; my $bcount = eval "'$bb' =~ tr[$l][$l]"; $sum += abs($acount-$bcount); } return $sum; } sub usingsg { my $sum = 0; for my $l ('A'..'Z') { my $acount = $aa =~ s[$l][$l]g || 0; my $bcount = $bb =~ s[$l][$l]g || 0; $sum += abs($acount-$bcount); } return $sum; } abaugher@bannor:~/work/perl/monks$ perl 938978.pl FHORAICOBMUCUMNYLRCUMVAMGXRRCADIZVTZTRENIEOBGNXSQT JIPPTFERERTBOQOPNQSGWDTTOZOTHNXPEKJACSXEQBOAPIMSHI Rate tr grep sg tr 924/s -- -45% -88% grep 1682/s 82% -- -79% sg 7849/s 749% 367% --

    Aaron B.
    My Woefully Neglected Blog, where I occasionally mention Perl.

      to put the whole thing into a single eval
      You should. Try this, on my machine, this beats your grep solution:
      sub usingtr_1e { eval join '+', map qq{abs(('$aa' =~ y/$_//)-('$bb' =~ y/$_//))}, 'A' .. 'Z'; }
      You forgot about the match operator:
      sub match { my $sum = 0; for my $l ('A'..'Z') { my $acount = 0; ++$acount while($aa =~ /$l/g); my $bcount = 0; ++$bcount while($bb =~ /$l/g); $sum += abs($acount-$bcount); } return $sum; }

      Skip eval and hard-code tr///:

      sub usingtr_d { my $a = $aa; my $b = $bb; # Delete ACGT first, assuming these are the # most common characters. abs($a =~ y/A//d - $b =~ y/A//d) + abs($a =~ y/C//d - $b =~ y/C//d) + abs($a =~ y/G//d - $b =~ y/G//d) + abs($a =~ y/T//d - $b =~ y/T//d) + abs($a =~ y/B// - $b =~ y/B//) + abs($a =~ y/D// - $b =~ y/D//) + abs($a =~ y/E// - $b =~ y/E//) + abs($a =~ y/F// - $b =~ y/F//) + abs($a =~ y/H// - $b =~ y/H//) + abs($a =~ y/I// - $b =~ y/I//) + abs($a =~ y/J// - $b =~ y/J//) + abs($a =~ y/K// - $b =~ y/K//) + abs($a =~ y/L// - $b =~ y/L//) + abs($a =~ y/M// - $b =~ y/M//) + abs($a =~ y/N// - $b =~ y/N//) + abs($a =~ y/O// - $b =~ y/O//) + abs($a =~ y/P// - $b =~ y/P//) + abs($a =~ y/Q// - $b =~ y/Q//) + abs($a =~ y/R// - $b =~ y/R//) + abs($a =~ y/S// - $b =~ y/S//) + abs($a =~ y/U// - $b =~ y/U//) + abs($a =~ y/V// - $b =~ y/V//) + abs($a =~ y/W// - $b =~ y/W//) + abs($a =~ y/X// - $b =~ y/X//) + abs($a =~ y/Y// - $b =~ y/Y//) + abs($a =~ y/Z// - $b =~ y/Z//); }

      Benchmarks, including roboticus' robo_1, trizen's match, and choroba's tr_1e:

      Rate tr grep tr_1e sg match robo_1 tr_d tr 843/s -- -18% -56% -75% -80% -92% -99% grep 1022/s 21% -- -47% -70% -76% -90% -99% tr_1e 1915/s 127% 87% -- -44% -55% -81% -98% sg 3419/s 306% 235% 79% -- -20% -66% -97% match 4291/s 409% 320% 124% 25% -- -57% -96% robo_1 9984/s 1085% 877% 421% 192% 133% -- -90% tr_d 100985/s 11881% 9784% 5175% 2853% 2254% 911% --

      Tested against these strings, which are ~80% ACGT:

      my @alpha = ('A'..'Z', qw[ A C G T ] x 20); my $aa = join '', map $alpha[rand @alpha], 1..100; my $bb = join '', map $alpha[rand @alpha], 1..100;
Re: The sum of absolute differences in the counts of chars in two strings.
by bluescreen (Friar) on Nov 19, 2011 at 17:08 UTC

    Are the strings likely to be equals ( or almost ) ? If so you could use a moving pointers and only do the counting if the characters at the pointer's position differ saving some cycles.

Re: The sum of absolute differences in the counts of chars in two strings.
by locked_user sundialsvc4 (Abbot) on Nov 19, 2011 at 21:56 UTC

    It just seems to me, and please don’t take this glibly or wrongly, that the first order of business would be to try to guess ... what would actually slow it down?   I can’t think of anything...

    If the set of characters is fixed, then a fixed-size array will accumulate counts with zero difference of timing of one character versus another, and there is zero possibility of page-fault activity here.   Likewise, I can scarce imagine how the time required to simply loop through the entire arrays, each and every time, to calculate the differences could ever be objectionable ... or for that matter, varying.  

    I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time.   A completely dumb, completely brute-force solution is going to crunch through any set of strings presented to it with an unchanging (and extremely rapid) cycle-time per character no matter what the data turns out to be.

    In short, my initial reaction is that the most-obvious way is the best way.   I genuinely suspect that any deviation from this straight-path just might wind up being significantly slower.

      I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time.

      Time and again here, one or more of the monks have spotted a solution to a problem that at first sight appear intractable to optimisation.

      For example: in Comparing a large set of DNA sequences, the OP describes an apparently simple process of cross comparing 100,000 x 20-char strings, as "It takes on the orders of days to process".

      Various approaches to this problem have been suggested down the years including the use of any of several well-known fuzzy matching and edit distance algorithms, like String::Approx, Text::Levenshtien, Text::Soundex Text::WagnerFischer and others. But these brute-force, char-by-char O(M*N) comparisons are horrible expensive -- even when coded in XS -- and often not so useful for the desired results either, as edit-distances are a quite different metric to the required: 'are these two strings the same except for in at most n places'.

      Another frequently offered alternative is the use of regexes, perhaps programmically generated. Whilst these work fine and beat the edit-distance algorithms hands down, they generally require quite-to-very complex regexes that, by requirement, contain many backtrack points, with the result that their runtime performance is understandably very poor.

      The root of the problem is the O(N2/2) nature of the round-robin comparison process. The exponential growth in the numbers of comparisons required, means that even for relatively low numbers of strings -- and in genomic work, 100,000 is low -- every extra microsecond taken performing each comparison adds an hour and a half to the processing time.

      And by the time you get to a still relatively small 1 million strings, each extra microsecond will cost you an extra 138 hours. Almost an extra week of waiting, not to mention the ~15kWhs of electricity used in the process.

      It would likely not be at all obvious to you that using string-wise XOR would form the basis of the most efficient (pure Perl) method of doing this, but its use here reduces days of processing to a few hours.

      Even less obvious is the likelihood that anyone would offer a solution that reduces the number of comparisons required from 5 billion fuzzy matches to the order of 300,000 O(1) lookups, yet here it is.

      Whilst my question has yet to garner anything other than more efficiently implemented brute-force algorithms -- themselves very valuable in the absence of a different solution -- there are still at least a couple of avenues yet unexplored.

      So, whilst I couldn't think of any game changing algorithm or approach to the problem, I'm long enough in the tooth to know that it does no harm to ask and can often lead to surprises.

      So, you'll forgive me -- or not -- if I file your "I can’t think of anything.. non-response, under the heading of: "Saw the opportunity to garner a few XP, but couldn't think of anything useful to say, so I'll say just that, and wrap it in some other plausible but meaningless drivel."


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Omitting ... please!! ... that final paragraph from your post, which was entirely unnecessary, I found your comments up to that point extremely interesting, and I immediately followed and read both the link that you gave in the second-from-last paragraph and the entire thread.

        Please enlighten me here ... I am quite, quite serious ... what is it about your original description of the problem that I have overlooked?   If the “alphabet” were a normal one, say no more than 256 elements, then we will have a 256-element array set to all-zero and we will tally each character with zero variance in time based on character value.   Then we will loop through exactly 256 accumulators, adding them up and setting them back to zero as we go.   And then we will have our answer.

        As do many of us, I have plenty of experience with (differing kinds of) “massive problems” in which very slight gains, repeated hundreds of millions of times if not many times more, make all the difference in the world.   Which, having said that, “is neither here nor there.”   So, I fully appreciate what you are saying right down to and including “15 kWh of electricity,” but I simply do not yet see what makes this particular problem “hard.”

        I know that it must be, or you would not have said it.   (I would be an idiot to question your knowledge or experience.)   But something in the description, I guess, eluded me.   What did I overlook?   I really do want to know.

Re: The sum of absolute differences in the counts of chars in two strings.(Results)
by BrowserUk (Patriarch) on Nov 21, 2011 at 17:28 UTC

    Here's my benchmark of a couple of PP and several I::C versions based (loosely) on the ideas posted in the thread along with my chosen I::C version which gains a little performance by using chars for the counting as 127 chars is a reasonable maximum length. And gains a little more by restricting the alphabet to printable ascii chars only, which is also reasonable for the envisaged use. The benchmark results are:

    Rate PP trg mb Pb JF me PP 0.168/s -- -92% -96% -97% -97% -97% trg 2.14/s 1170% -- -55% -58% -59% -68% mb 4.72/s 2703% 121% -- -8% -10% -30% Pb 5.13/s 2948% 140% 9% -- -3% -23% JF 5.27/s 3031% 147% 12% 3% -- -21% me 6.70/s 3881% 213% 42% 31% 27% --

    And the code:

    #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => 'deltasum', CLEAN_AFTER_BUILD => 0 +; // Assumes equal length ascii strings and no embedded nulls. int deltasum( char *a, char *b ) { signed char counts[256] = { 0, }; int i, sum = 0; while( *a && *b ) ++counts[ *a++ ], --counts[ *b++ ]; for( i = 32; i < 127; ++i ) sum += abs( counts[ i ] ); return sum; } int Perlbotics(char* a, char* b) { int counter_tab[256] = { 0 }; int i; int sum = 0; for ( ; *a ; ++a ) ++counter_tab[ *a ]; for ( ; *b ; ++b ) --counter_tab[ *b ]; for ( i = 0; i < 256; ++i ) { /* maybe reduced to significant range? + */ sum += abs( counter_tab[i] ); } return sum; } int mbethke(char *s, char *t) { int i, sum=0, freqs[256] = { 0 }; for(i=0; i<strlen(s); ++i) freqs[s[i]]++; for(i=0; i<strlen(t); ++i) freqs[t[i]]--; for(i=0; i<256; ++i) sum += abs(freqs[i]); return sum; } int JavaFan(char *s, char *t) { int i, sum=0, freqs[256] = { 0 }; while (*s) {freqs[*s++]++;} while (*t) {freqs[*t++]--;} for(i=0; i<256; ++i) sum += abs(freqs[i]); return sum; } END_C use Data::Dump qw[ pp ]; use Benchmark qw[ cmpthese ]; sub deltaSum { my( $sum, @counts ) = 0; ++$counts[ $_ ] for unpack 'C*', $_[0]; --$counts[ $_ ] for unpack 'C*', $_[1]; $sum += abs( $counts[ $_ ] // 0 ) for 32 .. 127; return $sum; } sub genTr { my $code = join "+", map{ "abs( \$_[0] =~ tr[$_][$_] - \$_[1] =~ tr[$_][$_] )" } @_; eval "sub{ $code; }"; } chomp( our @s = <DATA> ); our $I //= -1; cmpthese $I, { PP => q[ my $total = 0; for my $i ( 0 .. $#s ) { $total += deltaSum( @s[ $i, $_ ] ) for $i+1 .. $#s; } print "deltaSum : $total" if $I == 1; ], trg => q[ my $total = 0; my $sub = genTr( qw[a c g t ] ); for my $i ( 0 .. $#s ) { $total += $sub->( @s[ $i, $_ ] ) for $i+1 .. $#s; } print "tr_g : $total" if $I == 1; ], me => q[ my $total = 0; for my $i ( 0 .. $#s ) { $total += deltasum( @s[ $i, $_ ] ) for $i+1 .. $#s; } print "deltasum : $total" if $I == 1; ], Pb => q[ my $total = 0; for my $i ( 0 .. $#s ) { $total += Perlbotics( @s[ $i, $_ ] ) for $i+1 .. $#s; } print "Perbotics : $total" if $I == 1; ], mb => q[ my $total = 0; for my $i ( 0 .. $#s ) { $total += mbethke( @s[ $i, $_ ] ) for $i+1 .. $#s; } print "mbethke : $total" if $I == 1; ], JF => q[ my $total = 0; for my $i ( 0 .. $#s ) { $total += JavaFan( @s[ $i, $_ ] ) for $i+1 .. $#s; } print "JavaFan : $total" if $I == 1; ], }; __DATA__ aaaaaaaaaaacccg aaaaaaaaaaccccg aaaaaaacccccccc aaaaaaaaacccccg aaaaaaaaaaaccgg aaaaaaaaaacccgg aaaaaaacccccccg aaaaaaaaaccccgg aaaaaaaaaaccggg aaaaaaaacccccgg aaaaaaccccccccg aaaaaaaaacccggg aaaaacccccccccg aaaaaaaccccccgg aaaaccccccccccg aaaaaaaaccccggg aaaaaaaaaacgggg aaaaaacccccccgg aaaaaaaaaacccct aaaaaaaaaccgggg aaaaaccccccccgg aaaaaaacccccggg aaaaaaaaaccccct aaaaaaaaaaaccgt aaaacccccccccgg aaaaaaaacccgggg aaaaaaccccccggg aaaaaaaacccccct aaaaaaaaaaaaggt aaaaaaaaaacccgt aaaccccccccccgg aaaaacccccccggg aaaaaaaaacggggg aaaaaaaccccgggg aaaaaaaccccccct aaaaaaaaaccccgt aaaaaaaaaaacggt aaaaaaaaccggggg aaaaccccccccggg aaaaaacccccgggg aaaaaacccccccct aaaaaaaacccccgt aaaaaaaaaaccggt aaaaaccccccgggg aaacccccccccggg aaaaaaacccggggg aaaaaaaccccccgt aaaaaccccccccct aaaaaaaaacccggt aaaaaaaaaaagggt aaaaaaaacgggggg aaaacccccccgggg aaccccccccccggg aaaaaaccccggggg aaaaaaaaccccggt aaaaaaaaaacgggt aaaaaacccccccgt aaaacccccccccct aaaaacccccggggg aaaaaaaccgggggg aaaccccccccgggg aaaaaccccccccgt aaaaaaacccccggt aaaaaaaaaccgggt aaaaccccccggggg aaaaaaaaggggggg aacccccccccgggg aaaaaacccgggggg aaaaaaaacccgggt aaaaaaaaaaggggt aaaacccccccccgt aaaaaaccccccggt aacccccccccccct aaacccccccggggg accccccccccgggg aaaaaccccgggggg aaaaaaaaaaccctt aaaaaaacggggggg aaaaacccccccggt aaaaaaaccccgggt aaaccccccccccgt aaaaaaaaacggggt aaccccccccggggg aaaaaaaaacccctt aaaacccccgggggg aaaaaaccggggggg aaaaaaaaaaacgtt aacccccccccccgt aaaaaacccccgggt aaaaaaaaccggggt aaaaccccccccggt aaaaaaaaccccctt aaaaaaagggggggg acccccccccggggg aaaccccccgggggg aaaaacccggggggg aaaaaaaaaaccgtt aaacccccccccggt aaaaaaacccggggt aaaaaccccccgggt aaaaaaaaagggggt aaaaaaacccccctt aacccccccgggggg aaaaaacgggggggg aaaaaaaaacccgtt aaaaaaaaaaaggtt aaaaccccggggggg aaccccccccccggt aaaaaaccccggggt aaaaaaaacgggggt aaaacccccccgggt aaaaaaaaccccgtt aaaaaaaaaacggtt accccccccgggggg aaaaaccgggggggg aaacccccggggggg aaaaaaccccccctt aaaccccccccgggt acccccccccccggt aaaaacccccggggt aaaaaaaccgggggt aaaacccgggggggg aaaaaaacccccgtt aaaaaaaaaccggtt aaaaaaggggggggg cccccccccgggggg aaccccccggggggg aaaaacccccccctt aaaaaacccgggggt aaaaaaaaggggggt aaaaccccccggggt aacccccccccgggt acccccccggggggg aaaccccgggggggg aaaaaaaaaagggtt aaaaaaaacccggtt aaaaaaccccccgtt aaaaacggggggggg aaaaccccccccctt aaaaaccccgggggt aaacccccccggggt accccccccccgggt aaaaaaacggggggt aaaaacccccccgtt aaaaccggggggggg aacccccgggggggg aaacccccccccctt aaaaaaaaacgggtt ccccccccggggggg aaaaaaaccccggtt aaaaaaaaaaacttt aaccccccccggggt aaaaaaccggggggt aaaacccccgggggt aaacccggggggggg accccccgggggggg aaaaccccccccgtt aaccccccccccctt aaaaaaaaccgggtt aaaaaacccccggtt aaaaaaaaaaccttt aaaaacccggggggt aaaaaaagggggggt aaaccccccgggggt acccccccccggggt aaccccggggggggg aaaacgggggggggg aaaaaccccccggtt aaacccccccccgtt aaaaaaacccgggtt cccccccgggggggg aaaaaaaaaggggtt aaaaaaaaacccttt aaaaccccggggggt aaaaaacgggggggt aacccccccgggggt aaccccccccccgtt aaaacccccccggtt aaaaaaaacggggtt aaaccgggggggggg aaaaaaccccgggtt acccccggggggggg aaaaaaaaaacgttt aaaaaaaaccccttt aaaaaccgggggggt aaacccccggggggt accccccccgggggt aaaaacccccgggtt acccccccccccgtt aaaaaaaccggggtt aaaccccccccggtt aacccgggggggggg aaaaaaaaaccgttt aaccccccggggggt cccccccccgggggt aaaaaaggggggggt aaaaaaacccccttt aaaacccgggggggt accccgggggggggg aacccccccccggtt aaaaaacccggggtt aaaaccccccgggtt aaaaaaaagggggtt aaaaaaccccccttt aaaaaaaaaaggttt aaaaacggggggggt aaaccccgggggggt aaaaaaaacccgttt acccccccggggggt aaaaaccccggggtt aaaaaaacgggggtt aaccggggggggggg accccccccccggtt aaacccccccgggtt aaaaaaaccccgttt ccccccccggggggt aaaaccggggggggt aacccccgggggggt aaaaaaaaacggttt aaaaacccccccttt aaccccccccgggtt cccccccccccggtt aaaaaaccgggggtt aaaacccccggggtt aaacccggggggggt aaaaagggggggggt accccccgggggggt aaaaaacccccgttt aaaaaaaaccggttt aaaaccccccccttt aaaccccccggggtt acccccccccgggtt aaaaacccgggggtt aaaaaaaggggggtt cccccccgggggggt aaaaaaacccggttt aaaaaaaaagggttt aaaaaccccccgttt aaacccccccccttt aaaacgggggggggt aaccccggggggggt aaaaccccgggggtt aaaaaacggggggtt ccccccccccgggtt aacccccccggggtt aaaccgggggggggt aaccccccccccttt aaaaaaaacgggttt aaaacccccccgttt acccccggggggggt aaaaaaccccggttt aaaaaaaaaactttt aaaaaccggggggtt accccccccggggtt aaacccccgggggtt aaaccccccccgttt aaaaaaaccgggttt aaaaggggggggggt aaaaacccccggttt aacccgggggggggt ccccccggggggggt cccccccccggggtt aaaaaagggggggtt aaccccccgggggtt aaaaaaaaacctttt aaaacccggggggtt aaaaccccccggttt aaaaaacccgggttt aacccccccccgttt aaacggggggggggt aaaaaaaaggggttt accccgggggggggt aaaaaaaaaagtttt aaaccccggggggtt aaaaaaaaccctttt acccccccgggggtt aaaaacgggggggtt aaaaaccccgggttt aaccggggggggggt accccccccccgttt aaacccccccggttt aaaaaaacggggttt ccccccccgggggtt aaaaaaacccctttt aaaaccgggggggtt aacccccggggggtt aaaaaaaaacgtttt aaaacccccgggttt aaaaaaccggggttt acccggggggggggt aaccccccccggttt aaaaaaccccctttt aaaaaggggggggtt aaaaaaaaccgtttt accccccggggggtt aaacccgggggggtt acccccccccggttt aaaaacccggggttt aaaccccccgggttt aaaaaaagggggttt ccccggggggggggt aacgggggggggggt cccccccggggggtt aaaaaaaaaggtttt aaaacggggggggtt aaaaacccccctttt aaccccgggggggtt aaaaaaacccgtttt aacccccccgggttt aaaaaacgggggttt ccccccccccggttt aaaaccccggggttt aaaaaaaacggtttt aaaccggggggggtt aaaaccccccctttt aaaaaaccccgtttt acccccgggggggtt aaaaaccgggggttt aaacccccggggttt accccccccgggttt aaaaaaaccggtttt aaacccccccctttt aaaaacccccgtttt ccccccgggggggtt aaaagggggggggtt aacccggggggggtt acggggggggggggt aaaaaaggggggttt aaaacccgggggttt cccccccccgggttt aaccccccggggttt accccggggggggtt aaccccccccctttt aaaaaaaagggtttt aaaaccccccgtttt aaaaaacccggtttt aaacgggggggggtt aaaccccgggggttt acccccccggggttt aaaaacggggggttt aaaaaaacgggtttt aaaaaccccggtttt aaccgggggggggtt cccccggggggggtt aaacccccccgtttt acccccccccctttt aaaaccggggggttt aacccccgggggttt ccccccccggggttt aaaaaaaaacttttt aaccccccccgtttt acccgggggggggtt ccccccccccctttt aaaaaaccgggtttt aaaacccccggtttt aaaggggggggggtt aaaaaaaaccttttt aaaaagggggggttt aaacccggggggttt accccccgggggttt acccccccccgtttt aacggggggggggtt ccccgggggggggtt aaaaacccgggtttt aaaaaaaggggtttt aaaccccccggtttt aaccccggggggttt cccccccgggggttt aaaaaaaaagttttt aaaacgggggggttt aaaaaaacccttttt ccccccccccgtttt accggggggggggtt aacccccccggtttt aaaaaacggggtttt aaaaccccgggtttt aaaaaaaacgttttt acccccggggggttt aaaccgggggggttt aaaaaaccccttttt accccccccggtttt aagggggggggggtt aaacccccgggtttt aaaaaccggggtttt aacccgggggggttt aaaaggggggggttt aaaaacccccttttt ccccccggggggttt aaaaaaaccgttttt aaaaaagggggtttt acgggggggggggtt aaccccccgggtttt cccccccccggtttt aaaacccggggtttt aaaaccccccttttt accccgggggggttt aaaaaaaaggttttt aaaaaacccgttttt aaacggggggggttt ccgggggggggggtt acccccccgggtttt aaaccccggggtttt aaaaacgggggtttt aaaaaccccgttttt aaacccccccttttt cccccgggggggttt aaccggggggggttt aaaaaaacggttttt aacccccggggtttt aaaaccgggggtttt ccccccccgggtttt acccggggggggttt aaaacccccgttttt aaccccccccttttt aaaaaaccggttttt aaagggggggggttt aaacccgggggtttt accccccggggtttt aaaaaggggggtttt ccccggggggggttt aaaccccccgttttt acccccccccttttt aaaaaaagggttttt aacgggggggggttt aaaaacccggttttt aaaacggggggtttt aaccccgggggtttt cccccccggggtttt aaaaccccggttttt accgggggggggttt ccccccccccttttt aaaaaacgggttttt aacccccccgttttt acccccgggggtttt aaaaaaaactttttt aaaccggggggtttt aaacccccggttttt cccgggggggggttt aaggggggggggttt accccccccgttttt aaaaaccgggttttt aaaagggggggtttt aacccggggggtttt aaaaaaacctttttt ccccccgggggtttt aaaaaaggggttttt cccccccccgttttt acggggggggggttt aaccccccggttttt aaaacccgggttttt accccggggggtttt aaacgggggggtttt aaaaaaaagtttttt aaaaaaccctttttt acccccccggttttt ccggggggggggttt aaaaacggggttttt aaaccccgggttttt aaaaaaacgtttttt aaaaacccctttttt cccccggggggtttt aaccgggggggtttt aaaaccggggttttt ccccccccggttttt aacccccgggttttt aaaggggggggtttt aaaaccccctttttt acccgggggggtttt aaaaaaccgtttttt aaaaagggggttttt cgggggggggggttt accccccgggttttt aaacccggggttttt aaacccccctttttt aacggggggggtttt aaaaacccgtttttt ccccgggggggtttt aaaaaaaggtttttt cccccccgggttttt aaccccggggttttt aaaacgggggttttt aaccccccctttttt aaaaaacggtttttt accggggggggtttt aaaaccccgtttttt aaaccgggggttttt acccccggggttttt acccccccctttttt aagggggggggtttt aaaaaccggtttttt cccggggggggtttt aaacccccgtttttt aacccgggggttttt aaaaggggggttttt ccccccggggttttt aaaaaagggtttttt aaaacccggtttttt acgggggggggtttt aaccccccgtttttt aaaaaaaattttttt aaacggggggttttt accccgggggttttt aaaccccggtttttt ccgggggggggtttt acccccccgtttttt aaaaacgggtttttt aaccggggggttttt aaaaaaacttttttt cccccgggggttttt aaaaccgggtttttt ccccccccgtttttt aggggggggggtttt aacccccggtttttt aaaaaaccttttttt acccggggggttttt aaagggggggttttt cggggggggggtttt aaaaaggggtttttt aaacccgggtttttt accccccggtttttt aacgggggggttttt ccccggggggttttt aaaaacccttttttt aaaaaaagttttttt aaccccgggtttttt cccccccggtttttt aaaacggggtttttt aaaaaacgttttttt accgggggggttttt aaaaccccttttttt acccccgggtttttt aaaccggggtttttt cccgggggggttttt aaaaaccgttttttt aaggggggggttttt aaacccccttttttt ccccccgggtttttt aacccggggtttttt aaaagggggtttttt aaaacccgttttttt aaaaaaggttttttt acggggggggttttt aaccccccttttttt accccggggtttttt aaacgggggtttttt aaaaacggttttttt ccggggggggttttt aaaccccgttttttt acccccccttttttt cccccggggtttttt aaccgggggtttttt aaaaccggttttttt agggggggggttttt ccccccccttttttt aacccccgttttttt aaaggggggtttttt acccgggggtttttt cgggggggggttttt aaacccggttttttt accccccgttttttt aaaaagggttttttt ccccgggggtttttt aacggggggtttttt aaaaaaatttttttt cccccccgttttttt aaccccggttttttt aaaacgggttttttt accggggggtttttt aaaaaactttttttt ggggggggggttttt acccccggttttttt aaaccgggttttttt cccggggggtttttt aaaaacctttttttt aagggggggtttttt aaaaggggttttttt ccccccggttttttt aacccgggttttttt aaaaaagtttttttt acgggggggtttttt aaaaccctttttttt aaacggggttttttt accccgggttttttt ccgggggggtttttt aaaaacgtttttttt aaacccctttttttt aaccggggttttttt cccccgggttttttt aaaaccgtttttttt aaccccctttttttt aggggggggtttttt acccggggttttttt aaagggggttttttt aaacccgtttttttt cggggggggtttttt acccccctttttttt aaaaaggtttttttt ccccggggttttttt aacgggggttttttt aaaacggtttttttt aaccccgtttttttt ccccccctttttttt accgggggttttttt aaaccggtttttttt gggggggggtttttt acccccgtttttttt aaggggggttttttt cccgggggttttttt aaaagggtttttttt aacccggtttttttt ccccccgtttttttt acggggggttttttt accccggtttttttt aaacgggtttttttt aaaaacttttttttt ccggggggttttttt cccccggtttttttt aaccgggtttttttt aaaaccttttttttt agggggggttttttt acccgggtttttttt aaaggggtttttttt aaaaagttttttttt aaacccttttttttt cgggggggttttttt ccccgggtttttttt aacggggtttttttt aaaacgttttttttt aaccccttttttttt accggggtttttttt aaaccgttttttttt acccccttttttttt aagggggtttttttt cccggggtttttttt aacccgttttttttt aaaaggttttttttt acgggggtttttttt aaacggttttttttt accccgttttttttt ccgggggtttttttt cccccgttttttttt aaccggttttttttt aggggggtttttttt aaagggttttttttt acccggttttttttt cggggggtttttttt aacgggttttttttt ccccggttttttttt aaaactttttttttt accgggttttttttt aaacctttttttttt gggggggtttttttt aaggggttttttttt cccgggttttttttt aaaagtttttttttt aaccctttttttttt acggggttttttttt aaacgtttttttttt acccctttttttttt ccggggttttttttt aaccgtttttttttt agggggttttttttt acccgtttttttttt aaaggtttttttttt aacggtttttttttt accggtttttttttt ggggggttttttttt aagggtttttttttt cccggtttttttttt acgggtttttttttt aaacttttttttttt ccgggtttttttttt aaccttttttttttt aggggtttttttttt cggggtttttttttt aacgttttttttttt accgttttttttttt gggggtttttttttt aaggttttttttttt ccggttttttttttt cgggttttttttttt acctttttttttttt ggggttttttttttt

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: The sum of absolute differences in the counts of chars in two strings.
by TJPride (Pilgrim) on Nov 24, 2011 at 16:47 UTC
    Assuming the strings are fairly short and contain a limited number of unique characters, which is likely in this case:

    print diffCount('aaaaacaacaaagcc', 'acaggtgacaaaaaa'); sub diffCount { my (%counts, $sum, $i); $counts{substr($_[0], $_, 1)}++ for 0..(length($_[0]) - 1); $counts{substr($_[1], $_, 1)}-- for 0..(length($_[1]) - 1); $sum += abs $_ for values %counts; return $sum; }

    I tried other methods using split, regex, etc., but this benchmarked the fastest. Obviously, doing the same thing in C or C++ would be even faster.