Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I'm feeling a bit stuck. I basically am writing a sub-routine that compares two strings of dna, and prints out any mismatches that occur ( where good base-pairs are defined in the sub). My problem is that if the mismatch occurs on the first or last base of the string it messes up. e.g. for a sequence:
Mismatches are present on first and last positions e.g. A/A and G/G. I want the program to also print the bases either side of the mismatches e.g. AT/AA and CG/GG. But if the mismatch is on the first position it prints GA/GA and if it is on the last position it prints G/G. How can I get it to not print [$i -1] if the mismatch has occured on the first position, and to not print [$i+1] if the mismatch is on the last position. Thanks for your help. ;-)
sub get_mismatches { my ($segment, $comp_segment) = @_; my @seg_mismatches; my @seg_mis; my $i; my %good_pairs = ( 'A' => 'T', 'C' => 'G', 'G' => 'C', 'T' => 'A', ); foreach (my $i =0; $i < @$segment; $i++) { if ($segment->[$i] ne $good_pairs {$comp_segment->[$i]}) { push @seg_mismatches,"$segment->[$i-1]$segment->[$i]/$com +p_segment->[$i-1]$comp_segment->[$i]\n"; push @seg_mismatches, "$segment->[$i]$segment->[$i+1]/$co +mp_segment->[$i]$comp_segment->[$i+1]\n"; push @seg_mis, "$segment->[$i-1]$segment->[$i] $segment-> +[$i]$segment->[$i+1]\n"; } } return (\@seg_mismatches, \@seg_mis); }

update (broquaint): escaped square brackets + title change (was strings)

Replies are listed 'Best First'.
Re: Finding DNA string mismatches
by broquaint (Abbot) on Apr 22, 2003 at 11:03 UTC
(jeffa) Re: Finding DNA string mismatches
by jeffa (Bishop) on Apr 22, 2003 at 14:31 UTC
    Hi. This is not an answer. I just wanted to comment that there is a CPAN module (or rather a suite of them) called bioperl (there is even a website). I am always curious why i see lots of questions like this, but no recommendations to use bioperl. I can say for the record that i have tinkered with bioperl just a bit, and it is very complex (as this pdf file can attest). However, if i were doing DNA comparisons and such on a regular basis, i would most certainly put aside some time to learn how to use bioperl instead of rolling my own solutions. At least until i had a good feel for it.


    (the triplet paradiddle with high-hat)

      Given the regularity with which this problem comes uparound here, I concluded that this problem is used as a sample question for those learning to use perl in the Bioinformatics field.

      I have this mental picture of a lecturer somewhere (or several somewheres:), giving the question out as homework, running through the myriad variations of regex and substr solutions that result and then writing 3 lines of perl (including use Bioperl;) to demonstrate the "right way" to do it:).

      As problems go, it is a nice, clean, self-contained, real-world problem with a million possible solutions just ideal for teaching purposes.

      Examine what is said, not who speaks.
      1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
      2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
      3) Any sufficiently advanced technology is indistinguishable from magic.
      Arthur C. Clarke.
Re: Finding DNA string mismatches
by Ovid (Cardinal) on Apr 22, 2003 at 15:21 UTC

    I really wanted to solve this problem as it was rather interesting, but I couldn't figure out what you were looking for in @seg_mis, so you might want to read the node that broquaint suggested. In the meantime, here's what I whipped up. It's a bit closer, but not quite what you need. If you can show sample output for both arrays, that would help.

    #!/usr/bin/perl -w use strict; use Data::Dumper; my @segments = map { [split //] } qw( ATCG AAGG ); print Dumper get_mismatches(@segments); sub get_mismatches { my ($segment, $comp_segment) = make_base_pairs(@_); my @seg_mismatches; my @seg_mis; foreach my $i ( 0 .. $#$segment ) { my $p1 = $segment->[$i]; my $p2 = $comp_segment->[$i]; if ( bad_pair($p1->[0],$p2->[0]) or bad_pair($p1->[1],$p2->[1]) ) { push @seg_mismatches => sprintf "%s%s/%s%s", @$p1, @$p2; } } return (\@seg_mismatches, \@seg_mis); } sub make_base_pairs { my @segment = @_; # assumes length of 4. Don't know if this is correct my @results; foreach my $dna (@segment) { push @results => [ [@{$dna}[0..1]], [@{$dna}[2..3]] ]; } return @results; } sub bad_pair { my ($a,$b) = @_; my %good_pairs = ( 'A'=> 'T', 'C' => 'G', 'G' => 'C', 'T' => 'A' ); return 1 if exists $good_pairs{$a} and $good_pairs{$a} eq $b; }


    New address of my CGI Course.
    Silence is Evil (feel free to copy and distribute widely - note copyright text)