in reply to [BioPerl] add_seq gives warning: why?

There were a few issues with your posted code, but in simple script form this gave the same result:

use strict; use warnings; ++$|; ## buffering off my %align; my $count = 0; my %c2name; use Bio::SimpleAlign; my $self = Bio::SimpleAlign->new(); print "---> Reading data\n"; while( <DATA> ) { /^([^\#]\S+)\s+([A-Za-z\.\-]+)\s*/ && do { my $name = $1; my $seq = $2; if( ! defined $align{$name} ) { $count++; $c2name{$count} = $name; } $align{$name} .= $seq; print "Count >$count< - Adding Name >$name<\n\tSeq >$seq<\n"; }; } print "---> Forming alignment\n"; $count = 0; foreach my $no ( sort { $a <=> $b } keys %c2name ) { my $name = $c2name{$no}; my( $seqname, $start, $end, $strand ); if( $name =~ /(\S+)\/(\d+)-(\d+)$/ ) { $seqname = $1; $start = $2; $end = $3; } elsif ( $name =~ /(\S+)\/(\d+)-(\d+):(\d+)-(\d+)/ ) { $seqname = $1; my $ns = $2; my $s = $3; my $e = $4; my $ne = $ +5; $start = "$ns-$s"; $end = "$e-$ne"; # surprise: this is le +gal $strand = 1; } ## make sure id is unique #$seqname .= 'x' while ( exists $align{id}{$seqname} ); #++$align{id}{$seqname}; print "Name >$name<\n\tID >$seqname<\n"; my $seq = new Bio::LocatableSeq( '-seq'=>$align{$name}, '-id'= +>$seqname, '-start'=>$start, '-end'=>$end, '-strand'=>$strand, '-type'=>' +aligned' ); $self -> add_seq($seq); $count++; } print "Count : $count\n"; __DATA__ hit1_EF374296.1_1-432/1-432 uauGGAAACWUACU hit1_AM161438.1_1-497/20-516 gAGAAACCCUGGAA hit1_AM161438.1_1-497/1-1:497-993 gGAAAAUCCGUCGA hit1_EF374296.1_1-432/1-1:432-863 UGAAAAUCCGUCGA hit1_EF374296.1_509-949/509-509:949-1389 GGAAAAUCCGUCGA hit1_EF374296.1_509-949/938-1382 AUAGUAAGAGGAAA hit1_EF374297.1_30-470/30-30:470-910 GGAAAAUCCGUCGA

which gave :

---> Reading data Count >1< - Adding Name >hit1_EF374296.1_1-432/1-432< Seq >uauGGAAACWUACU< Count >2< - Adding Name >hit1_AM161438.1_1-497/20-516< Seq >gAGAAACCCUGGAA< Count >3< - Adding Name >hit1_AM161438.1_1-497/1-1:497-993< Seq >gGAAAAUCCGUCGA< Count >4< - Adding Name >hit1_EF374296.1_1-432/1-1:432-863< Seq >UGAAAAUCCGUCGA< Count >5< - Adding Name >hit1_EF374296.1_509-949/509-509:949-1389< Seq >GGAAAAUCCGUCGA< Count >6< - Adding Name >hit1_EF374296.1_509-949/938-1382< Seq >AUAGUAAGAGGAAA< Count >7< - Adding Name >hit1_EF374297.1_30-470/30-30:470-910< Seq >GGAAAAUCCGUCGA< ---> Forming alignment Name >hit1_EF374296.1_1-432/1-432< ID >hit1_EF374296.1_1-432< Name >hit1_AM161438.1_1-497/20-516< ID >hit1_AM161438.1_1-497< Name >hit1_AM161438.1_1-497/1-1:497-993< ID >hit1_AM161438.1_1-497< Name >hit1_EF374296.1_1-432/1-1:432-863< ID >hit1_EF374296.1_1-432< Name >hit1_EF374296.1_509-949/509-509:949-1389< ID >hit1_EF374296.1_509-949< -------------------- WARNING --------------------- MSG: Replacing one sequence [hit1_EF374296.1_1-432/1-432] --------------------------------------------------- Name >hit1_EF374296.1_509-949/938-1382< ID >hit1_EF374296.1_509-949< Name >hit1_EF374297.1_30-470/30-30:470-910< ID >hit1_EF374297.1_30-470< Count : 7

The problem i think was using $seqname as your object id, raher than the full (unique) id. Bio::SimpleAlign needs unique ids maybe?

Anyway, I added in the bit that made sure the ids are unique ( currently commented out in the above ), but still make sense to you(?), and it give the same as above, but the error is gone.

Maybe this is a little bit like just turning off warnings... but the problem does stem from your ids, not the code, so i think this is a reasonable workaround, which doesn't rely on users having to always provide unique ids...

Hope this helps?

Just a something something...

Replies are listed 'Best First'.
Re^2: [BioPerl] add_seq gives warning: why?
by BioNick (Initiate) on Jan 12, 2010 at 13:14 UTC
    Your code makes sense: you add an x to a name if it already exists, making it unique. This may lead to names with endless rows of x's (in theory, in the real world there will not be much more than 2 or 3) but still unique because of them. It's a very simple workaround and I like it! In the end, I don't want the x's to turn up in my alignmentfiles but I can easily remove them later.

    Thanks!

      Probbaly a better workaround, might be to use the full name (including the position info) as the id, rather than what is currently happening. Presumably there can't be sequences with the same base name and position? This way you don't have to worry about taking x's off etc...

      Just a something something...