Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi how do i do a direct comparison of an array of strings. I have multiple DNA sequences stored as an array of strings and i need to compare every base at the same position in each string with the corresponding base in the other strings, identifing bases that do not match (as opposed to matches) and i need to know the position of each mismatch. Any tips would be very helpful thank you
  • Comment on How do i do a direct comparison of an array of strings?

Replies are listed 'Best First'.
Re: How do i do a direct comparison of an array of strings?
by AnomalousMonk (Archbishop) on Feb 19, 2009 at 19:38 UTC
    If the strings/strands being compared are all the same length, bitwise-xoring the strings effectively compares them. The tr operator can then be used to flag and count positions of equality. IIRC, GrandFather had a note on this a few months back.
    >perl -wMstrict -le "my ($x, $y) = qw{abXcdYef pqXrsYtu}; my $diff = $x ^ $y; $diff =~ tr{\x00-\xff}{=!}; print $diff; print 'number of differences: ', $diff =~ tr{!}{!}; my @i_diffs = do { my $i = 0; map $_->[1], grep $_->[0] eq '!', map [ $_, $i++ ], split '', $diff; }; print qq{differing positions: @i_diffs}; " !!=!!=!! number of differences: 6 differing positions: 0 1 3 4 6 7
    Updates:
    1. Found GrandFather's reply: Re: match and mismatch.
    2. Changed example to actually count and report positions of differing characters per OPer's request.
      Actually, you don't even have to bother with the  tr stuff if you only want the differing positions:
      >perl -wMstrict -le "my ($x, $y) = qw{abXcdYef pqXrsYtu}; my @i_diffs = differing_positions($x, $y); print 'number of differences: ', scalar @i_diffs; print qq{differing positions: @i_diffs}; sub differing_positions { my $i = 0; map $_->[1], grep $_->[0] ne qq{\x00}, map [ $_, $i++ ], split '', $_[0] ^ $_[1] ; } " number of differences: 6 differing positions: 0 1 3 4 6 7

        Or even:

        sub differing_positions { my @c = unpack('C*', $_[0] ^ $_[1]) ; return grep $c[$_], (0..$#c) ; }
      A regex solution is significantly faster than a ST one.

      Some benchmarks:

      use warnings; use strict; use Benchmark qw(cmpthese); my $multiplier = 10_000; my $s__ = "abcdefghij" x $multiplier; # base string my $s10 = "Abcdefghij" x $multiplier; # 10% differ my $s50 = "ABCDEfghij" x $multiplier; # 50% differ my $s90 = "ABCDEFGHIj" x $multiplier; # 90% differ sub differ_at_st { my $i = 0; return [ map $_->[1], grep $_->[0] ne qq{\x00}, map [ $_, $i++ ], split '', $_[0] ^ $_[1] ]; } sub differ_at_rx { my $diff = $_[0] ^ $_[1]; my @diffs; push @diffs, $-[1] while $diff =~ m{ ([^\000]) }xmsg; return \@diffs; } print "multiplier == $multiplier \n"; cmpthese(-10, { st_10 => sub { differ_at_st($s__, $s10) }, st_50 => sub { differ_at_st($s__, $s50) }, st_90 => sub { differ_at_st($s__, $s90) }, rx_10 => sub { differ_at_rx($s__, $s10) }, rx_50 => sub { differ_at_rx($s__, $s50) }, rx_90 => sub { differ_at_rx($s__, $s90) }, });
      Output:
      C:\@Work\Perl\monks\745091>perl 745091_1.pl multiplier == 1000 Rate st_90 st_50 st_10 rx_90 rx_50 rx_10 st_90 10.7/s -- -0% -12% -49% -73% -94% st_50 10.7/s 0% -- -11% -49% -73% -94% st_10 12.1/s 13% 13% -- -42% -70% -93% rx_90 20.9/s 95% 95% 73% -- -48% -88% rx_50 40.0/s 274% 273% 230% 91% -- -77% rx_10 177/s 1562% 1555% 1367% 750% 344% -- C:\@Work\Perl\monks\745091>perl 745091_1.pl multiplier == 10000 Rate st_90 st_50 st_10 rx_90 rx_50 rx_10 st_90 1.03/s -- -2% -11% -50% -73% -94% st_50 1.05/s 2% -- -9% -49% -72% -94% st_10 1.15/s 13% 10% -- -44% -69% -93% rx_90 2.05/s 100% 96% 78% -- -45% -88% rx_50 3.76/s 267% 260% 226% 83% -- -77% rx_10 16.6/s 1513% 1483% 1334% 707% 340% --
Re: How do i do a direct comparison of an array of strings?
by kennethk (Abbot) on Feb 19, 2009 at 15:33 UTC
    You could use nested For Loops wrapped around a comparison of two substrings. What have you tried? Do you have any code?
Re: How do i do a direct comparison of an array of strings?
by dwm042 (Priest) on Feb 19, 2009 at 16:52 UTC
    Let me point out that the result of a single comparison is going to be a list of mismatches. If your array of DNA samples is of size N, then the matrix of comparisons ends up being N x N in size (with N*(N-1) unique sets of data, as you can ignore the self comparisons).

    How you handle storing that is really up to you.

    A function for handling a single comparison and returning the results as a reference to an array is:

    sub dna_mismatches { my $dna_standard = shift; my $dna_to_compare = shift; my @mismatches = ( 0 ); # # assume same length strands # my $last_std = length($dna_standard); for ( 0 .. $last_std - 1 ) { if ( substr( $dna_standard, $_, 1 ) ne substr ( $dna_to_compare, $_, 1 )) { $mismatches[0]++; push @mismatches, $_ + 1; } } return \@mismatches; }