Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi monks, i have a string matching question. I want to be able to match complementary reversed sequences within an array. Certain letters are complements of each other e.g. A + T , C + G (in dna). I want to iterate through an array and match those strings that are complementary in the reverse orientation. e.g. GTTC + GAAC would be a match (GTTC complement is CAAG, which reversed is GAAC).

I am getting confused, do I need some sort of look-up table which says A -> T , C->G , G->C , T->A ?? If so, how can I do this?

Any help i'll be very grateful ;-> # below is my snippet of code, it is wrongly looking for exact matches, but I need to look for reverse complementary matches.

for (my $i=0; $i < @array1; $i++) { $dna = reverse $array1[$i]; $dna =~ tr/actg/tgac/; if ($dna eq $array1[$i]) { # problem is here, I am testing for equality, but need to test for com +plementarity! print "\nFOUND $array1[$i] \t $dna"; } }

Replies are listed 'Best First'.
Re: string complement question
by Limbic~Region (Chancellor) on Nov 21, 2003 at 13:45 UTC
    Anonymous Monk,
    Here are the assumptions I made since your original post is unclear:

  • You want to compare each element of an array against every other element in the array
  • You only want to reverse one of the two elements being matched
  • Elements will all be uppercase, contain only the letters T A C G, and will be the same length
  • You do not want to repeat matches

    #!/usr/bin/perl -w use strict; my @dna = qw(GTTC TTTT GGGG CCCC GAAC); my %seen; for my $index_1 ( 0 .. $#dna ) { next if $seen{$index_1}; $seen{$index_1} = 1; my $dna1 = reverse $dna[ $index_1 ]; for my $index_2 ( 0 .. $#dna ) { next if $seen{$index_2}; my $dna2 = $dna[ $index_2 ]; if ( match($dna1 , $dna2) ) { $seen{$index_2} = 1; print join " : " , $dna1, $index_1, $dna2, $index_2; print $/; } } } sub match { my @dna = map { [ split // ] } @_; my %val = ( A => -1, T => 1, G => -2, C => 2 ); for ( 0 .. $#{$dna[0]} ) { return 0 if $val{ $dna[0]->[$_] } + $val{ $dna[1]->[$_] }; } return 1; }

    Cheers - L~R

Re: string complement question
by Abigail-II (Bishop) on Nov 21, 2003 at 12:21 UTC
    Well, $dna is the reversed complement of $array1[$i]. But you are comparing it to itself. Now there will be strings that will match this way (AATT for instance), but you say you want GTTC to match GAAC.

    That's two things you want to compare. But you are comparing one thing with itself. I'm not sure what you where you want to compare your dna fragments with.

    Abigail

Re: string complement question
by Art_XIV (Hermit) on Nov 21, 2003 at 13:49 UTC

    Will this help?

    use strict; my %sequences; while (<DATA>) { chomp; $sequences{$_} = $.; } for my $seq (keys %sequences) { my $rev_compl = get_rev_compl($seq); print "$seq matches $rev_compl on line $sequences{$rev_compl}\n" if exists $sequences{$rev_compl}; } sub get_rev_compl { my ($seq) = @_; $seq =~ tr/GCAT/CGTA/; return reverse $seq; } 1; __DATA__ GATTACA TTAGCAA CATCATC TTGCTAA TGTAATC GATGATG

    I put the sequences into a hash instead of an array to (hopefully) increase the lookup speed. This won't work if the sequences aren't unique, though.

    Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
      Easily fixed by remembering all indices, e.g. like this:
      #!/usr/bin/perl -w use strict; chomp(my @sequences = <DATA>); my (%sequences, $i); push @{$sequences{$_}}, $i++ for @sequences; for (keys %sequences) { my $complement = reverse; $complement =~ tr/GCAT/CGTA/; print "$_ complements sequence nr @{$sequences{$complement}}\n" if $sequences{$complement}; } __DATA__ GATTACA GATTACA TTAGCAA CATCATC TTGCTAA TGTAATC GATGATG
Re: string complement question
by b10m (Vicar) on Nov 21, 2003 at 12:28 UTC
    This might be a horribly slow sollution, but I'll give it anyway, and have the real monks take a shot at ;)
    my $i; my $j; for ($i=0; $i < @array1; $i++) { $dna = reverse $array1[$i]; $dna =~ tr/actg/tgac/; for ($j=0; $j < @array1; $j++) { if ($dna eq $array1[$j]) { print "FOUND $dna\t$array[$j]\n"; } } }
    Hope this is what you mean?

    --
    B10m

      thanks B10m, but this is still just getting the exact match! oh well...
        Mmmm, my bad. I wasn't paying much attention (apparantly). I was still checking and printing the same stuff, which didn't really make sense. This should do the trick (still slow, and I'm looking forward to faster ways :)
        my $i; my $j; my $dna; my $temp_dna; for ($i=0; $i < @array1; $i++) { $dna = $array1[$i]; $temp_dna = $dna; $dna = reverse $dna; $dna =~ tr/actg/tgac/; for ($j=0; $j < @array1; $j++) { if ($dna eq $array1[$j]) { print "FOUND $temp_dna\t$array[$j]\n"; } } }
        Please note that I made a lot of steps to get to this. This could be shortened, but this is more readable (for me at least ;)
        --
        B10m
Re: string complement question
by duff (Parson) on Nov 21, 2003 at 16:17 UTC

    From what you've described, your code does what you want. I.e., the string "gttcgaac" matches because if you compliment it and reverse it you get "gttcgaac" which means that the original string is its own reversed compliment.

            gttcgaac
            caagcttg
    
    Now, if you want to find other occurences of the reversed compliment within the array, you'll have to cache all of the reversed compliments somewhere and then do equality comparisons against each element of the array.

    Or did you mean something different?