in reply to How to find any of many motifs?

While you wait for BioPerl to download and install, and then work out how to use it, this might do the job :)

#! perl -slw use strict; open MOTIFS, '<', '845226.motif' or die $!; my @motifs = <MOTIFS>; close MOTIFS; chomp @motifs; open SEQS, '<', '845226.fasta' or die $!; while( my $seqID = <SEQS> ) { chomp $seqID; $seqID =~s[^>][]; my $seq = ''; { local $/ = ">"; chomp( $seq .= <SEQS> ); $seq =~ tr[\n][]d; } for my $motif ( @motifs ) { if( $seq =~ m[$motif([^A-Z]+)[A-Z]] ) { print "$seqID contains $motif; distance ", length( $1 ); } } } close SEQS; __END__ C:\test>845226 uc002yje.1 chr21:13973492-13976330 contains gccccac; distance 88 uc002yje.1 chr21:13973492-13976330 contains gccccac; distance 92 uc002yje.1 chr21:13973492-13976330 contains gggggaaaaaacc; distance 2 +0 uc002yje.1 chr21:13973492-13976330 contains gccccac; distance 90 uc002yje.1 chr21:13973492-13976330 contains agagggccc; distance 4

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^2: How to find any of many motifs?
by Nikulina (Novice) on Jun 18, 2010 at 07:59 UTC
    Thanks a million for your help. I hope I'll finish soon BioPerl installation. Your script is great and it does the job I tried to get from mine. But I have one problem: in the case when the gene doesn't contain any of motifs, the output looks like following:
    NM_002232 range=chr1:111015833-111020178 contains ; distance 1014
    So, as far as I understand it prints out the length of the whole region situated before the exon start point (all the small letters ). Actually this kind of information is not required in the task. How this can be skipped in the outpt? Thank you once more for your great contribution into my biologilcal invstigations %)!!! P.S: this morning I have modified my script so, it almost does the job too, but there are still some problems to solve. But your support has helped me a lot!

      Never mind. I think I see the problem.

      Your motifs file contains some blank lines. A blank line will become undef, and undef will match anything.

      A simple modification to my code to deal with this is:

      if( defined $motif and $seq =~ m[$motif([^A-Z]+)[A-Z]] ) {

      It would perhaps be better to deal with this at source--ie. remove blank lines from the motifs file--but data is rarely perfect in the real world, so the above fix is as good a way of dealing with it as any.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      But I have one problem: in the case when the gene doesn't contain any of motifs, the output looks like following:
      NM_002232 range=chr1:111015833-111020178 contains ; distance 1014

      So, as far as I understand it prints out the length of the whole region situated before the exon start point (all the small letters ). Actually this kind of information is not required in the task. How this can be skipped in the outpt?

      Hm. That shouldn't happen. It should only print something if a match is found.

      Could you please post the relevant motif & sequence that correspond to the above output so that I can use it to debug my code?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.