Re: How to find any of many motifs?
by BrowserUk (Patriarch) on Jun 17, 2010 at 18:07 UTC
|
While you wait for BioPerl to download and install, and then work out how to use it, this might do the job :)
#! perl -slw
use strict;
open MOTIFS, '<', '845226.motif' or die $!;
my @motifs = <MOTIFS>;
close MOTIFS;
chomp @motifs;
open SEQS, '<', '845226.fasta' or die $!;
while( my $seqID = <SEQS> ) {
chomp $seqID;
$seqID =~s[^>][];
my $seq = '';
{
local $/ = ">";
chomp( $seq .= <SEQS> );
$seq =~ tr[\n][]d;
}
for my $motif ( @motifs ) {
if( $seq =~ m[$motif([^A-Z]+)[A-Z]] ) {
print "$seqID contains $motif; distance ", length( $1 );
}
}
}
close SEQS;
__END__
C:\test>845226
uc002yje.1 chr21:13973492-13976330 contains gccccac; distance 88
uc002yje.1 chr21:13973492-13976330 contains gccccac; distance 92
uc002yje.1 chr21:13973492-13976330 contains gggggaaaaaacc; distance 2
+0
uc002yje.1 chr21:13973492-13976330 contains gccccac; distance 90
uc002yje.1 chr21:13973492-13976330 contains agagggccc; distance 4
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Thanks a million for your help. I hope I'll finish soon BioPerl installation.
Your script is great and it does the job I tried to get from mine. But I have one problem:
in the case when the gene doesn't contain any of motifs, the output looks like following:
NM_002232 range=chr1:111015833-111020178 contains ; distance 1014
So, as far as I understand it prints out the length of the whole region situated before the exon start point (all the small letters ). Actually this kind of information is not required in the task. How this can be skipped in the outpt?
Thank you once more for your great contribution into my biologilcal invstigations %)!!!
P.S:
this morning I have modified my script so, it almost does the job too, but there are still some problems to solve. But your support has helped me a lot! | [reply] [d/l] |
|
|
Never mind. I think I see the problem.
Your motifs file contains some blank lines. A blank line will become undef, and undef will match anything.
A simple modification to my code to deal with this is:
if( defined $motif and $seq =~ m[$motif([^A-Z]+)[A-Z]] ) {
It would perhaps be better to deal with this at source--ie. remove blank lines from the motifs file--but data is rarely perfect in the real world, so the above fix is as good a way of dealing with it as any.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
But I have one problem: in the case when the gene doesn't contain any of motifs, the output looks like following:
NM_002232 range=chr1:111015833-111020178 contains ; distance 1014
So, as far as I understand it prints out the length of the whole region situated before the exon start point (all the small letters ). Actually this kind of information is not required in the task. How this can be skipped in the outpt?
Hm. That shouldn't happen. It should only print something if a match is found.
Could you please post the relevant motif & sequence that correspond to the above output so that I can use it to debug my code?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: How to find any of many motifs?
by toolic (Bishop) on Jun 17, 2010 at 16:01 UTC
|
Regarding BioPerl, I have not used it, but you may be able to get more information by reading the Tutorial -> Perl and Bioinformatics
If you still want to pursue developing your own script, you need to provide more information:
- Post small, relevant samples of your input files, preferably less than 10 lines each.
- Post the actual output you get, including any errors or warnings.
- Post the output you want.
- use strict and warnings
The goal is to post something self-contained that anyone of us can download and run. | [reply] |
|
|
Thank you for your quick resonse
Here are some samples:
genes saved in sequence file look like this:
>hg18_refGene_NM_006511 range=chr1:15857951-15860804
cctagtcacctgctaattaatctttttcctttcccctgtgttccatatagagaaaagtggtaaagaatct
+acctcactgggaATGTCATCATTACCAACTTCAGATGGGTTTAACCATCC
>hg18_refGene_NR_027268 range=chr1:28294033-28296656
tttttttttttcttcttcttttttgtggagttgctttttttttttttcttcttttttgtgtcattgtttt
+tttaccatgtcctcagagtctggaattttacaaagtagttgctgagtaaacaggTAGATACAAATCAAG
+ATGAACTTTTCCGCAGGAGCTCTTATCATGAACAT
and the binding motifs:
>mot1
tttgtagatttt
>mot2
ggatcctttaag
>mot3
gtttaccccaa
| [reply] [d/l] |
|
|
Here are the actual sequnces I analyse:
for GENES:
>uc002yje.1 chr21:13973492-13976330
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
acgcgccctacactctggcatgggggaacccggccccgcagagccctgga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
>uc002yje.1 chr21:13973492-13976330
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
>uc002yje.1 chr21:13973492-13976330
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
for motifs:
>ucmotif_1
gccccac
>ucmotif_2
gggggaaaaaacc
>ucmotif_3
agagggccc
here is the output:
the distance is the following: 88
the distance is the following: 20
the distance is the following: 4
So, now the program searches for the first element in the first gene and for the second in the second and so on. I wish that it could find any of motifs for each gene if they present in it and count the length between the motif an exon start point (starts with capital letters).
| [reply] [d/l] |
|
|
Re: How to find any of many motifs?
by biohisham (Priest) on Jun 17, 2010 at 16:55 UTC
|
Hello Nikulina, welcome to Perl Monks, take time trying to understand the requirements for BioPerl installation. I haven't tried CPAN manual installation for BioPerl since I find the PPM quite a stable one. Now, since you have added all the repositories mentioned in the table that is quite it!. Searching for All the packages from PPM would not yield BioPerl, since it is a Perl non-core module, so you need to search for it in the repositories you have connected to, AND then INSTALL it for it to be added to the modules you currently have. This in itself is not a big task if you are spending patient time understanding it...
Check installation instructions, it got some links that are useful..
In brief, after you have gotten familiar with what installation requirements are needed for your environment, invoke the PPM manager in shell mode by typing ppm-shell at your command prompt, then type search BioPerl, check if the search returned "Bundle-BioPerl" after that just type install Bundle-BioPerl, if that is not the case, then make sure that you are connected to all the repositories you have added to your package manager, by typing repo list and see how many enabled repositories are there... and enabling every disabled one by typing repo on followed by the repository id number..
I find the PPM shell mode more robust than the GUI one..
if ( $seq =~ /[A-Z]/) is better written as if ( $seq =~ /[AGCT]/) ...
Excellence is an Endeavor of Persistence.
A Year-Old Monk :D .
| [reply] [d/l] [select] |
Re: How to find any of many motifs?
by biohisham (Priest) on Jun 17, 2010 at 19:14 UTC
|
Using a BioPerl module to parse sequences files lifts away the burden of having to manually detect what constitutes a sequence start and end for many of the available formats, also, it takes care of handling the sequence objects and IO tasks efficiently and succinctly.
While learning Perl, make sure you are learning it the right way by building on the good habits that would at the end lead you to becoming an established hacker, great emphasis is there to always
use strict;
use warnings;
At the top of your programs, these stricture would enable you to save time debugging by giving you warnings about potential pitfalls and enforcing rules like "variable localization and scoping", that would really save you from a lot of errors. Also, make sure that your variables are named in a self-descriptive fashion and that your code is indented in a readable way.
Here is how you can perform the same job by using the library Bio::SeqIO from the BioPerl suite.
use strict;
use warnings;
use Bio::SeqIO;
my $seq_file = 'D:/Motif Search/sequence.txt';
my $mot_file = 'D:/Motif Search/motif.txt';
my $in = Bio::SeqIO->new( #sequences object
-format => 'fasta',
-file => $seq_file,
);
my $mot = Bio::SeqIO->new( #motifs object
-format => 'fasta',
-file => $mot_file
);
my @motifs;
while ( my $motif = $mot->next_seq ) { #read in motifs
my $motif_string = $motif->seq; #stringify motifs
push @motifs, $motif_string;
}
while ( my $seq = $in->next_seq ) {
my $string = $seq->seq;
foreach my $motif (@motifs) {
if ( $string =~ /$motif([agct]+)[AGCT]/ ) {
print "'$motif' ", length $1, " bases away at ", $seq->id,
"\n";
}
}
}
Have a great Perl journey ...
Excellence is an Endeavor of Persistence.
A Year-Old Monk :D .
| [reply] [d/l] [select] |
Re: How to find any of many motifs?
by llancet (Friar) on Jun 18, 2010 at 13:56 UTC
|
Bio::SeqIO;
Bio::PrimarySeqI;
index($str,$substr,$start_position);
| [reply] [d/l] |
Re: How to find any of many motifs?
by llancet (Friar) on Jun 22, 2010 at 10:52 UTC
|
Ok, a more detailed hint:
1: read the motif file using Bio::SeqIO, and store those motif in a hash.
2: traverse through the target sequence file usingBio::SeqIO. On each target seq, use index on each motif to find whether the seq has the motif.
use strict;
use Bio::SeqIO;
my $file_seq = shift;
my $file_motif = shift;
# read all motif
# store in hash: ID => motif_seq
my %all_motif;
my $MOTIF_FH = Bio::SeqIO->new(-file=>$file_motif);
while (my $obj = $MOTIF_FH->next_seq) {
$all_motif{$obj->display_id} = $obj->seq;
}
$MOTIF_FH->close;
# traverse through all sequences
my $SEQ_FH = Bio::SeqIO->new(-file=>$file_seq);
while (my $obj = $SEQ_FH->next_seq) {
my $id = $obj->display_id;
my $seq = $obj->seq;
print "Sequence: $id\n";
# traverse through all motifs
foreach my $motif_id (sort keys %all_motif) {
my @curr_result;
# find all positions of current motif
my $last_i = 0;
while (1) {
my $i = index($seq,$all_motif{$motif_id},$last_i);
last if $i==-1;
push @curr_result,$i+1;
$last_i = $i;
}
# print if has result
if (@curr_result>0) {
print "\tmatch with $motif_id at: @curr_result\n";
}
}
}
$SEQ_FH->close;
| [reply] [d/l] |