BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given two strings:

  1. count the number of each character within each.
  2. then find the absolute difference of the counts for each character.
  3. Then sum those differences.

This is not a difficult process. Split the strings into characters, use a couple of hashes or arrays to accumulate the counts. Iterate those and perform the subtractions. Sum the differences.

But I need to do this many times, so doing it efficiently is critical. No code to avoid influencing the creativity. Suggestions?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on The sum of absolute differences in the counts of chars in two strings.

Replies are listed 'Best First'.
Re: The sum of absolute differences in the counts of chars in two strings.
by Corion (Patriarch) on Nov 19, 2011 at 15:41 UTC

    If the used alphabet is known and most of the letters are likely to be used and the strings are not too long, I think that it would be worthwhile to investigate tr/// over split+count, potentially by generating one huge eval string:

    my $code = join "\n;\n", map {sprintf '$count{ $_ } = tr/%s/%s/', $_, +$_ } @alphabet; my %count; local $_ = $string1; eval $code;

    The second string counting could be rolled into a second eval:

    my $code2 = join "\n+\n", map {sprintf 'abs($count{ %s } - tr/%s/%s/)' +, $_, $_, $_ } @alphabet; local $_ = $string2; eval $code2;

    But as the tr-approach reads the strings multiple times, this will really depend on the length of the strings and the size of the alphabet.

Re: The sum of absolute differences in the counts of chars in two strings.
by Perlbotics (Archbishop) on Nov 19, 2011 at 16:47 UTC

    Here's my brute-force try. Don't know if the call-overhead is bigger than the gain? I assumed a one-byte-is-one-character relationship (no wchar's yet). HTH

    use Inline C; use strict; use warnings; use Time::HiRes qw(time); my $str1 = "abcdefabc"; my $str2 = "xyzabcabc@@@"; my $now = time; my $res; $res = deltasum( $str1, $str2 ) for (1..1_000_000); print "res=$res time=", ( time() - $now ), "us\n"; #res=9 time=0.819153070449829us __END__ __C__ #define DEBUG(x) int counter_tab[256]; int deltasum(char* a, char* b) { int i; int sum = 0; DEBUG( printf("IN: (%d:%s) (%d:%s)\n", strlen(a), a, strlen(b), b); +) bzero( counter_tab, sizeof( counter_tab ) ); for ( ; *a ; ++a ) ++counter_tab[ *a ]; for ( ; *b ; ++b ) --counter_tab[ *b ]; for ( i = 0; i < 256; ++i ) { /* maybe reduced to significant range? + */ sum += abs( counter_tab[i] ); DEBUG( if ( counter_tab[i] ) printf( "'%c' (%3d) x %5d\n", i, i, c +ounter_tab[i]); ) } DEBUG( printf("OUT: %d\n", sum); ) return sum; }
    Update: Moved i and sum to head of function to cope with ancient compilers (thanks to davido). Simplified sizeof() expression.

    Debug sample:

    IN: (9:abcdefabc) (12:xyzabcabc@@@) '@' ( 64) x -3 'd' (100) x 1 'e' (101) x 1 'f' (102) x 1 'x' (120) x -1 'y' (121) x -1 'z' (122) x -1 OUT: 9

    Update: Approx. 60% speed-up when limiting to [ACGTacgt], still room for improvement...

Re: The sum of absolute differences in the counts of chars in two strings.
by roboticus (Chancellor) on Nov 19, 2011 at 17:31 UTC

    BrowserUk:

    Something like this?

    $ cat 938978.pl #!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; my ($x, $y) = split /\s+/,$_; last if ! defined $y; print robo_1($x,$y), " <$x> <$y>\n"; } sub robo_1 { my %h; $h{$_}++ for split//,shift; $h{$_}-- for split//,shift; my $t=0; $t += ($_>0 ? $_ : -$_) for values %h; return $t; } __DATA__ 123456 abcdefghijkl ARF_ARF_ARF BARKBARKBARK doom gloom heartbreak anguish $ perl 938978.pl 18 <123456> <abcdefghijkl> 11 <ARF_ARF_ARF> <BARKBARKBARK> 3 <doom> <gloom> 13 <heartbreak> <anguish>

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: The sum of absolute differences in the counts of chars in two strings.
by mbethke (Hermit) on Nov 19, 2011 at 18:03 UTC
    Now I just saw Perlbotics has posted an Inline C solution already but anyway, this is a bit simpler:
    use Inline C; print char_freq_diff_sum("123456", "abcdefghijkl"),"\n"; __DATA__ __C__ #define abs(x) (x<0?-x:x) /* no need for math.h */ int char_freq_diff_sum(char *s, char *t) { int i, sum=0, freqs[256]={0}; for(i=0; i<strlen(s); ++i) freqs[s[i]]++; for(i=0; i<strlen(t); ++i) freqs[t[i]]--; for(i=0; i<256; ++i) sum += abs(freqs[i]); return sum; }
    Of course it makes the same dangerous assumptions like 8-bit character set. Unicode from UCS2 should be possible using the same principle though, a couple orders of magnitude slower but still much faster than pure Perl.
      Do note that in C, strlen isn't a constant time operation - it walks the string looking for a NUL byte. So, for speed issues, one either caches the result, or does away with the index altogether:
      while (*s) {freq[*s++]++;} while (*t) {freq[*t++]--;}
      Your freqs[256]={0}; is unfamiliar syntax to me. Some gcc modernism to initialize the array?
        int freqs[256]={0};

        I think that syntax is part of C99, inherited back as a "good idea" from C++. If the initializer list in curly brackets is shorter than the number of elements in the entity being initialized, all elements not enumerated in the list will initialize to zero. Ex: int freqs[256]={1}; would initialize as freqs[0] = 1, freqs[1] = 0, freqs[2] = 0, etc.

        gcc offers some additional features beyond that, but the syntax used here should be portable to any C99 implementation.


        Dave

        I'm not sure when this array initialization was standardized but since I learned C almost 20 years ago I haven't been treated to any compiler that didn't understand it.

        You've got a point about the strlen. Actually I had it in a separate function first that got a const char* string argument, in that case any optimizer worth its name should notice it's an invariant and do the caching transparently. I haven't looked at the compiled code so it may be that a non-const string would disable this optimization, although the code is simple enough that I'd think the optimizer would notice the string is never modified inside the loop and do it anyway.

Re: The sum of absolute differences in the counts of chars in two strings.
by Limbic~Region (Chancellor) on Nov 19, 2011 at 17:24 UTC
    BrowserUk,
    How long are the strings (min, max, avg)? Will they be the same length? Are the characters single byte (8 bits)? Assuming 1 byte characters, is the alphabet of possible characters restricted or can we assume all 256 possibilities?

    Also, can you give a manual example? I am concerned about misunderstanding the requirements.

    Cheers - L~R

      The strings could contain anything and be of any length, but the example I'm working with is genomic data and less than 100 chars.

      A worked example:

      aaaaacaacaaagcc :: a=>10 c=>4 g=>1 t=>0 acaggtgacaaaaaa :: a=>9 c=>2 g=>3 t=>1 absolute diffs :: 1 2 2 1 sum of diffs :: 6

      It would be wrong to assume an alphabet of 4 even for genomic data.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,
        I don't understand the example. Should that t => 0 and t => 1 be 1 not 2?

        Update: I am not going to have a chance to play with any of my ideas so I will just share them here in case they are of any help. I was hoping that there would be a way to "cancel out terms" such that there was less work to do. The two ideas I had for that would be performing bitwise operations on the strings to find out which characters were in common and only counting the remaining ones. The second idea I had would be to process the string in chunks rather than characters. If I were trying to do a generic solution though - I would like go with Inline::C (array vs hash) incrementing values for the first string, decrementing values for the second string and summing the 256 indices for the result in the end.

        Cheers - L~R

        How about count others (not agct) as an exception?

        my $aa="AGCTAAABBBCCC"; my %k=(a=>"a",g=>"g",c=>"c",t=>"t",else=>"[^agct]"); my %all; while( my($k,$pattern)=each %k ){ $all{$k}++ while ($aa =~ m/$pattern/gi); }
Re: The sum of absolute differences in the counts of chars in two strings.
by aaron_baugher (Curate) on Nov 19, 2011 at 19:21 UTC

    I don't suppose there's anything particularly creative about my attempts, and I'm sure they won't compare to the speed of C. But I've been looking for a reason to learn to use Benchmark, so I thought I'd give this a try.

    I tried three methods: counting the characters with tr inside an eval, counting with s///g, and splitting the strings into arrays and using grep. Somewhat surprisingly (to me, anyway, as someone who very rarely uses eval), the eval/tr method was by far the slowest. I assume that's from the overhead of eval, but as far as I know, that's necessary, since the variable has to be interpolated. Perhaps it would be faster, as someone upthread suggested, to put the whole thing into a single eval, rather than 52 separate eval calls, but I'm too attached to keeping eval usage as minimal as possible to feel good about that. The grep method was better, but not great, even though I split the strings outside the test. The easy winner was using an s///g regex to count the replacements like tr, but without the need for eval.

    abaugher@bannor:~/work/perl/monks$ cat 938978.pl #!/usr/bin/perl use Modern::Perl; use Benchmark qw(:all);; my( $aa, $bb ) = ('',''); $aa .= chr(65 + rand(26)) for (1..50); $bb .= chr(65 + rand(26)) for (1..50); say $aa; say $bb; my @aa = split //, $aa; my @bb = split //, $bb; cmpthese(100000, { 'grep' => \&usinggrep, 'tr' => \&usingtr, 'sg' => \&usingsg, }); sub usinggrep { my $sum = 0; for my $l ('A'..'Z') { my $acount = grep /$l/, @aa; my $bcount = grep /$l/, @bb; $sum += abs($acount-$bcount); } return $sum; } sub usingtr { my $sum = 0; for my $l ('A'..'Z') { my $acount = eval "'$aa' =~ tr[$l][$l]"; my $bcount = eval "'$bb' =~ tr[$l][$l]"; $sum += abs($acount-$bcount); } return $sum; } sub usingsg { my $sum = 0; for my $l ('A'..'Z') { my $acount = $aa =~ s[$l][$l]g || 0; my $bcount = $bb =~ s[$l][$l]g || 0; $sum += abs($acount-$bcount); } return $sum; } abaugher@bannor:~/work/perl/monks$ perl 938978.pl FHORAICOBMUCUMNYLRCUMVAMGXRRCADIZVTZTRENIEOBGNXSQT JIPPTFERERTBOQOPNQSGWDTTOZOTHNXPEKJACSXEQBOAPIMSHI Rate tr grep sg tr 924/s -- -45% -88% grep 1682/s 82% -- -79% sg 7849/s 749% 367% --

    Aaron B.
    My Woefully Neglected Blog, where I occasionally mention Perl.

      to put the whole thing into a single eval
      You should. Try this, on my machine, this beats your grep solution:
      sub usingtr_1e { eval join '+', map qq{abs(('$aa' =~ y/$_//)-('$bb' =~ y/$_//))}, 'A' .. 'Z'; }
      You forgot about the match operator:
      sub match { my $sum = 0; for my $l ('A'..'Z') { my $acount = 0; ++$acount while($aa =~ /$l/g); my $bcount = 0; ++$bcount while($bb =~ /$l/g); $sum += abs($acount-$bcount); } return $sum; }

      Skip eval and hard-code tr///:

      sub usingtr_d { my $a = $aa; my $b = $bb; # Delete ACGT first, assuming these are the # most common characters. abs($a =~ y/A//d - $b =~ y/A//d) + abs($a =~ y/C//d - $b =~ y/C//d) + abs($a =~ y/G//d - $b =~ y/G//d) + abs($a =~ y/T//d - $b =~ y/T//d) + abs($a =~ y/B// - $b =~ y/B//) + abs($a =~ y/D// - $b =~ y/D//) + abs($a =~ y/E// - $b =~ y/E//) + abs($a =~ y/F// - $b =~ y/F//) + abs($a =~ y/H// - $b =~ y/H//) + abs($a =~ y/I// - $b =~ y/I//) + abs($a =~ y/J// - $b =~ y/J//) + abs($a =~ y/K// - $b =~ y/K//) + abs($a =~ y/L// - $b =~ y/L//) + abs($a =~ y/M// - $b =~ y/M//) + abs($a =~ y/N// - $b =~ y/N//) + abs($a =~ y/O// - $b =~ y/O//) + abs($a =~ y/P// - $b =~ y/P//) + abs($a =~ y/Q// - $b =~ y/Q//) + abs($a =~ y/R// - $b =~ y/R//) + abs($a =~ y/S// - $b =~ y/S//) + abs($a =~ y/U// - $b =~ y/U//) + abs($a =~ y/V// - $b =~ y/V//) + abs($a =~ y/W// - $b =~ y/W//) + abs($a =~ y/X// - $b =~ y/X//) + abs($a =~ y/Y// - $b =~ y/Y//) + abs($a =~ y/Z// - $b =~ y/Z//); }

      Benchmarks, including roboticus' robo_1, trizen's match, and choroba's tr_1e:

      Rate tr grep tr_1e sg match robo_1 tr_d tr 843/s -- -18% -56% -75% -80% -92% -99% grep 1022/s 21% -- -47% -70% -76% -90% -99% tr_1e 1915/s 127% 87% -- -44% -55% -81% -98% sg 3419/s 306% 235% 79% -- -20% -66% -97% match 4291/s 409% 320% 124% 25% -- -57% -96% robo_1 9984/s 1085% 877% 421% 192% 133% -- -90% tr_d 100985/s 11881% 9784% 5175% 2853% 2254% 911% --

      Tested against these strings, which are ~80% ACGT:

      my @alpha = ('A'..'Z', qw[ A C G T ] x 20); my $aa = join '', map $alpha[rand @alpha], 1..100; my $bb = join '', map $alpha[rand @alpha], 1..100;
Re: The sum of absolute differences in the counts of chars in two strings.
by bluescreen (Friar) on Nov 19, 2011 at 17:08 UTC

    Are the strings likely to be equals ( or almost ) ? If so you could use a moving pointers and only do the counting if the characters at the pointer's position differ saving some cycles.

Re: The sum of absolute differences in the counts of chars in two strings.
by locked_user sundialsvc4 (Abbot) on Nov 19, 2011 at 21:56 UTC

    It just seems to me, and please don’t take this glibly or wrongly, that the first order of business would be to try to guess ... what would actually slow it down?   I can’t think of anything...

    If the set of characters is fixed, then a fixed-size array will accumulate counts with zero difference of timing of one character versus another, and there is zero possibility of page-fault activity here.   Likewise, I can scarce imagine how the time required to simply loop through the entire arrays, each and every time, to calculate the differences could ever be objectionable ... or for that matter, varying.  

    I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time.   A completely dumb, completely brute-force solution is going to crunch through any set of strings presented to it with an unchanging (and extremely rapid) cycle-time per character no matter what the data turns out to be.

    In short, my initial reaction is that the most-obvious way is the best way.   I genuinely suspect that any deviation from this straight-path just might wind up being significantly slower.

      I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time.

      Time and again here, one or more of the monks have spotted a solution to a problem that at first sight appear intractable to optimisation.

      For example: in Comparing a large set of DNA sequences, the OP describes an apparently simple process of cross comparing 100,000 x 20-char strings, as "It takes on the orders of days to process".

      Various approaches to this problem have been suggested down the years including the use of any of several well-known fuzzy matching and edit distance algorithms, like String::Approx, Text::Levenshtien, Text::Soundex Text::WagnerFischer and others. But these brute-force, char-by-char O(M*N) comparisons are horrible expensive -- even when coded in XS -- and often not so useful for the desired results either, as edit-distances are a quite different metric to the required: 'are these two strings the same except for in at most n places'.

      Another frequently offered alternative is the use of regexes, perhaps programmically generated. Whilst these work fine and beat the edit-distance algorithms hands down, they generally require quite-to-very complex regexes that, by requirement, contain many backtrack points, with the result that their runtime performance is understandably very poor.

      The root of the problem is the O(N2/2) nature of the round-robin comparison process. The exponential growth in the numbers of comparisons required, means that even for relatively low numbers of strings -- and in genomic work, 100,000 is low -- every extra microsecond taken performing each comparison adds an hour and a half to the processing time.

      And by the time you get to a still relatively small 1 million strings, each extra microsecond will cost you an extra 138 hours. Almost an extra week of waiting, not to mention the ~15kWhs of electricity used in the process.

      It would likely not be at all obvious to you that using string-wise XOR would form the basis of the most efficient (pure Perl) method of doing this, but its use here reduces days of processing to a few hours.

      Even less obvious is the likelihood that anyone would offer a solution that reduces the number of comparisons required from 5 billion fuzzy matches to the order of 300,000 O(1) lookups, yet here it is.

      Whilst my question has yet to garner anything other than more efficiently implemented brute-force algorithms -- themselves very valuable in the absence of a different solution -- there are still at least a couple of avenues yet unexplored.

      So, whilst I couldn't think of any game changing algorithm or approach to the problem, I'm long enough in the tooth to know that it does no harm to ask and can often lead to surprises.

      So, you'll forgive me -- or not -- if I file your "I can’t think of anything.. non-response, under the heading of: "Saw the opportunity to garner a few XP, but couldn't think of anything useful to say, so I'll say just that, and wrap it in some other plausible but meaningless drivel."


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Omitting ... please!! ... that final paragraph from your post, which was entirely unnecessary, I found your comments up to that point extremely interesting, and I immediately followed and read both the link that you gave in the second-from-last paragraph and the entire thread.

        Please enlighten me here ... I am quite, quite serious ... what is it about your original description of the problem that I have overlooked?   If the “alphabet” were a normal one, say no more than 256 elements, then we will have a 256-element array set to all-zero and we will tally each character with zero variance in time based on character value.   Then we will loop through exactly 256 accumulators, adding them up and setting them back to zero as we go.   And then we will have our answer.

        As do many of us, I have plenty of experience with (differing kinds of) “massive problems” in which very slight gains, repeated hundreds of millions of times if not many times more, make all the difference in the world.   Which, having said that, “is neither here nor there.”   So, I fully appreciate what you are saying right down to and including “15 kWh of electricity,” but I simply do not yet see what makes this particular problem “hard.”

        I know that it must be, or you would not have said it.   (I would be an idiot to question your knowledge or experience.)   But something in the description, I guess, eluded me.   What did I overlook?   I really do want to know.

Re: The sum of absolute differences in the counts of chars in two strings.(Results)
by BrowserUk (Patriarch) on Nov 21, 2011 at 17:28 UTC

    Here's my benchmark of a couple of PP and several I::C versions based (loosely) on the ideas posted in the thread along with my chosen I::C version which gains a little performance by using chars for the counting as 127 chars is a reasonable maximum length. And gains a little more by restricting the alphabet to printable ascii chars only, which is also reasonable for the envisaged use. The benchmark results are:

    Rate PP trg mb Pb JF me PP 0.168/s -- -92% -96% -97% -97% -97% trg 2.14/s 1170% -- -55% -58% -59% -68% mb 4.72/s 2703% 121% -- -8% -10% -30% Pb 5.13/s 2948% 140% 9% -- -3% -23% JF 5.27/s 3031% 147% 12% 3% -- -21% me 6.70/s 3881% 213% 42% 31% 27% --

    And the code:


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: The sum of absolute differences in the counts of chars in two strings.
by TJPride (Pilgrim) on Nov 24, 2011 at 16:47 UTC
    Assuming the strings are fairly short and contain a limited number of unique characters, which is likely in this case:

    print diffCount('aaaaacaacaaagcc', 'acaggtgacaaaaaa'); sub diffCount { my (%counts, $sum, $i); $counts{substr($_[0], $_, 1)}++ for 0..(length($_[0]) - 1); $counts{substr($_[1], $_, 1)}-- for 0..(length($_[1]) - 1); $sum += abs $_ for values %counts; return $sum; }

    I tried other methods using split, regex, etc., but this benchmarked the fastest. Obviously, doing the same thing in C or C++ would be even faster.