AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have an alignment file and I want to analyze it and output all of the positions that are not conserved. I was thinking of reading in the file as an array of arrays. Below is an example input file. And below that is my current code. Which currently produces the wrong output, and I feel is probably not the most correct way to go about this. Any help is appreciated.

5 15 1 ATCG--ATCG-ATCG 2 ATGC--ATCG-ATCG 3 ATGC-A-TCG-ATCG 4 ATGC--ATCG-ATCG 5 ATCG--ATCG-AACG
use strict; use warnings; my $align_file=$ARGV[0]; #.py (phylip file right now) open (F, $align_file) || die "Could not open $align_file: $!"; my @align_matrix; my %HoTaxids; my @columns; while (my $line = <F>) { chomp($line); <F>; # skip firts line of phylip next if $line =~ /^\s*$/; # skip blank lines my ($taxid,$align_line)=split(/\s/,$line); @columns = split(undef, $align_line);#splitting on undef split +s each character push (@align_matrix, \@columns); $HoTaxids{$taxid}=1; } close(F); my $num_rows=scalar(keys %HoTaxids); my $align_len=scalar(@columns); for (my $i=0;$i<$num_rows;$i++) { for (my $j=0;$j<$align_len;$j++){ if ($align_matrix[$i][$j] ne $align_matrix[$i][$j+1]) { print "position\t$j\t is not conserved\n"; } } }

Replies are listed 'Best First'.
Re: manipulating alignment data
by moritz (Cardinal) on Mar 28, 2012 at 11:43 UTC
    I guess by "not conserved" you mean you're looking for columns that aren't the same in each line. If that's true, this might get you started:
    use strict; use warnings; my $first = <DATA>; chomp $first; my $template = ''; while (<DATA>) { chomp; $template |= $first ^ $_; } for (split //, $template) { print ord == 0 ? '.' : 'X'; } print "\n"; __DATA__ ATCG--ATCG-ATCG ATGC--ATCG-ATCG ATGC-A-TCG-ATCG ATGC--ATCG-ATCG ATCG--ATCG-AACG

    The output produces a dot for each column where all rows have the same character, and an X for those where there are at least two different characters:

    ..XX.XX.....X..

    If that's not what you want, please make your description more verbose, and include the expected output for your example input.

      for (split //, $template) { print ord == 0 ? '.' : 'X'; } print "\n";

      Another way to do that:

      $template =~ tr/\0/X/c; $template =~ tr/\0/./; print "$template\n";

      But this looks like it produces the output the OP wants:

      print "position $-[0] is not conserved\n" while $template =~ /[^\0]/g;

      thank you.Yes, by conserved I mean columns where there is more than one letter. Could you please explain the following lines more verbosely. Also, I need the position, so I guess in your case instead of printing I could feed it into and array or hash and then print out the positions with an X

      $template |= $first ^ $_;
      print ord == 0 ? '.' : 'X';

        infix ^ is the bitwise XOR operator, so the resulting string from $first ^ $_ has a zero-byte at every position where $first and $_ are equal.

        Likewise | is the bitwise OR operator, so $template accumulates non-zero bytes at any character position where any row differed from $first.

        print ord == 0 ? '.' : 'X';

        ord == 0 simply checks if $_ is the zero-byte. If yes, a dot is returned, otherwise an X.

        For more information look into perlop, search for "bitwise" and "ternary".

Re: manipulating alignment data
by jwkrahn (Abbot) on Mar 28, 2012 at 11:39 UTC
    my current code. Which currently produces the wrong output

    What does the correct output look like?



    while (my $line = <F>) { chomp($line); <F>; # skip firts line of phylip

    That last line will skip every second line of the file (lines 2, 4, 6, 8, etc.)    If you only want to skip the first line then:

    <F>; # skip first line of phylip while (my $line = <F>) { chomp($line);

      oops, yes should put the <F> before the while loop, thanks.

      the correct output,should be:

      position 2 is not conserved position 3 is not conserved position 5 is not conserved position 6 is not conserved position 8 is not conserved

      i am getting that all positions EXCEPT position 4 are conserved.

Re: manipulating alignment data
by AnomalousMonk (Archbishop) on Mar 28, 2012 at 16:18 UTC
    >perl -wMstrict -le "use List::Util qw(reduce); use vars qw($a $b); ;; my @seqs = qw( ATCG--ATCG-ATCG ATGC--ATCG-ATCG ATGC-A-TCG-ATCG ATGC--ATCG-ATCG ATCG--ATCG-AACG ); ;; my $cons = reduce { $a ^ $b } @seqs; defined($cons) or die 'no sequences given'; ;; my $mask = @seqs % 2 ? $seqs[0] : (qq{\0} x length $seqs[0]); $cons ^= $mask; $cons =~ tr{\0-\xff}{.X}; print qq{'$cons'}; " '..XX.XX.....X..'

    The code assumes all sequences have the same length. (I'm not sure just what happens if they don't.) The  use vars qw($a $b); statement suppresses some pesky "Name ... used only once: possible typo at..." messages.

    BTW: Both moritz and I get 2, 3, 5, 6 and 12 (not 8) as non-conserved positions. Is this correct?

    Important Update: Bug! Sorry, this solution is sensitive to data sequence. To see the problem exemplified, try the sequence set

    my @seqs = qw( CTCG--ATCG-ATCG CTGC--ATCG-ATCG ATGC-A-TCG-ATCG ATGC--ATCG-ATCG );