comment on

Hi All, I understand this should need just a minor tweak, But I couldn't really come up with some sensible idea.
Wat I would really want is to find the position of the mismatch(and the actual mismatch) between the query and the Subject in the Inputfile

I have attached my code and the Input file that it needs. My code actually works fine if the line that it works on has got jus fifteen character(including spaces) that are to be removed(in the beginning of the alignment ie, Sbjct\s+\d{5} ). But in some cases, there are sixteen characters where it fails and gives a wrong position of mismatch!

For example,
Case1 with 15characters(/&)spaces:(Start from Sbjt to the end of the numbers before the alignment starts)

Query  10550  CTTGGTTAGTACTGAATCCCATATATACTATGTTTTTCCTATACATATGTACTTAT
+GATA  10609
              ||||| ||||||||||||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct  74391  CTTGGATAGTACTGAATCCCATATATACTATGTTTTTCCTATACATATGTACTTAT
+GATA  74332
[download]

SInce I have got the substr to parse at 15. If I have got 16 character /spaces.
Example

Query  16319   CCCACTCGGGCCCGGCTCCAGCTCCTGCACCGCCTGGGCCAGCCTCCGCATGTTA
+AGGGC  16378
               ||||||||||||| |||||||||||||||||||||||||||||| ||||||||||
+|||||
Sbjct  140831  CCCACTCGGGCCCCGCTCCAGCTCCTGCACCGCCTGGGCCAGCCACCGCATGTTA
+AGGGC  14077
[download]

in here the Sbjct has got 6 digit long number and hence the alignment is moved by a space. It starts from 16th position, rather than 15th in the previous. This results in wrong positions.
My code


my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
        -host => 'ensembldb.ensembl.org',
        -user => 'anonymous'
);

my $home = $ENV{'HOME'};


my($ID, $query, $off, $idi, $subject, $ref, $st);


print "ID\tposition\tvariation\tRef Genome coordinates\n";
unless(open DATA, "Input_files/Contig_Alignment_Selected_3.txt"){die "
+Cannot open the file file $! \n";}
while(<DATA>) {
 chomp;

  if(m[^>]) { #Checks the start of the alignements 
  
   ($ID) = (split '\|',$_)[1];#splits the first line with '|' 
   ($ref) = $ID =~ /(\d+)\s+ref$/;
   
  }
 if(/^\s+Identities/){ #gets the percentage of identity
   
   my($identity, undef) = split/,/ ;
   ($idi) = $identity =~ /\sIdentities\s\=\s\d{3}\/\d{3}\s\((\d{2,3}\%
+)\)$/;
 } 
 
 if(/^\s+Strand/){ #check strands Plus/Minus
  
  ($st) = $_ =~/^\s\w+\=\w{4}\/(\w{4,5})$/;
  
 }

  

  if(m/^Query/) {
  ($query) = m[^Query\s+(\d+)];
  
   my $top = substr $_, 15;#substring the first 15 char
  
   my $pipes = substr <DATA>,15; #same,if the Sbjct is more than 5 num
+bers then this doesnt worx  
   my $subject = <DATA>;
   my($value) = $subject  =~ /^Sbjct\s+(\d+)/;
 
   my $bot = substr $subject, 15;#if the Sbjct is more than 5 numbers 
+then this doesnt work
   my $p = 0 ;
   while ($p = 1+index $pipes,' ', $p) {
       
    my $pos1 = $value-$p;
 
    my $pos2 = $value+$p;
    my $var1 = substr( $top, $p-1, 1 );
    my $var2 = substr( $bot, $p-1, 1 );
   # my $genomref1 = 4900000 + $pos1;
    my $genomref2 = 4899999 + $pos2;
     if($st eq "Minus") {
      
     
      print join"\t", $ref,$pos1, $var1."/".$var2,$genomref2 ;
      
      snpdetails($genomref2);
      
      
      
     }else{
     print join "\t", $ref,$pos2, $var1."/".$var2,$genomref2;
     snpdetails($genomref2); 
     }
         
   }
  }
  
 #}
[download]

Input file

BLASTN 2.2.24+
Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and
Webb Miller (2000), "A greedy algorithm for aligning DNA
sequences", J Comput Biol 2000; 7(1-2):203-14.


RID: 5ZHMGK7311R

Query=  NODE_16_length_35408_cov_15.061031
Length=35478


                                                                   Sco
+re        E
Sequences producing significant alignments:                       (Bit
+s)     Value

lcl|14079  ref|NC_000009.11|:4900000-5300000 Homo sapiens chro...  1.6
+55e+04  0.0  

ALIGNMENTS
>lcl|14079 ref|NC_000009.11|:4900000-5300000 Homo sapiens chromosome 9
+, 
GRCh37 primary reference assembly
Length=400001

 Score = 1.655e+04 bits (8960),  Expect = 0.0
 Identities = 9014/9037 (99%), Gaps = 15/9037 (0%)
 Strand=Plus/Minus

Query  10190  TGGAGTGCAGTGGCGCAATCTCGGCTCACTGCAAGCATCGCCTCCTGGGTTCACGC
+CATT  10249
              ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct  74751  TGGAGTGCAGTGGCGCAATCTCGGCTCACTGCAAGCATCGCCTCCTGGGTTCACGC
+CATT  74692

Query  10250  CTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCATCTGCCACCATGCCCCA
+CTAA  10309
              ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct  74691  CTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCATCTGCCACCATGCCCCA
+CTAA  74632

Query  10310  ttttttctattttttAGTAGAGACGGGGTTTCACCATGTTAGCCAGGATGGTCTCG
+ATCT  10369
              ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct  74631  TTTTTTCTATTTTTTAGTAGAGACGGGGTTTCACCATGTTAGCCAGGATGGTCTCG
+ATCT  74572

Query  10370  CCTGACCTCGTGATCCGCCCACCTCAGCCTCCCAAAGTGCTGGGATTACAGGCGTG
+AGCC  10429
              ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+||||
Sbjct  74571  CCTGACCTCGTGATCCGCCCACCTCAGCCTCCCAAAGTGCTGGGATTACAGGCGTG
+AGCC  74512


Query  36624   aTGTTTTGAGCATATAGGGAAAATTTATAAAAATTGGCCATGATGaaacataagc
+tcaaa  36683
               |||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|||||
Sbjct  100670  ATGTTTTGAGCATATAGGGAAAATTTATAAAAATTGGCCATGATGAAACATAAGC
+TCAAA  100611

Query  36684   aagtttaaaaagaaaactcctaaaagttggcataacaaagcctaaaaaTCATTTC
+AAACT  36743
               |||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|||||
Sbjct  100610  AAGTTTAAAAAGAAAACTCCTAAAAGTTGGCATAACAAAGCCTAAAAATCATTTC
+AAACT  100551

Query  36744   TGGTATAACTGTTACTAGAAAACCATCTACACAATGACTATATATATGCCTTTAT
+TTCAT  36803
               |||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|||||
Sbjct  100550  TGGTATAACTGTTACTAGAAAACCATCTACACAATGACTATATATATGCCTTTAT
+TTCAT  100491

Query  36804   TTTTATGTTACGCTTCTCTTTATATTTGAATCATTCCTTTAAACTACATAAACAT
+TTTCA  36863
               |||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|||||
Sbjct  100490  TTTTATGTTACGCTTCTCTTTATATTTGAATCATTCCTTTAAACTACATAAACAT
+TTTCA  100431

Query  36864   AGTGTTTGTAAATACCCTTTTAAAAATTACTGCTGTTAGCTGTTCTTCATGATTT
+TCTTA  36923
               |||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|||||
Sbjct  100430  AGTGTTTGTAAATACCCTTTTAAAAATTACTGCTGTTAGCTGTTCTTCATGATTT
+TCTTA  100371

Query  36924   CTGGTCTCCTTACACATTCGAAATTGGACATTTCCGACTATTTCCTTGGTATGTT
+TTATA  36983
               |||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|||||
Sbjct  100370  CTGGTCTCCTTACACATTCGAAATTGGACATTTCCGACTATTTCCTTGGTATGTT
+TTATA  100311
[download]

All I need is to be able to get the right position on either case , 15/16!(Sbjct 7457
1 /Sbjct 100370 )
I appreciate all your help and suggestion Thanks in advance for your time.
Regards

In reply to finding the position by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.