Nikulina has asked for the wisdom of the Perl Monks concerning the following question:

Good day! I am just starting my adventures in perl and stongly need your help. At the moment I am busy with mapping of certain motifs in human genes. There are some genes saved in the file called sequence.txt (in FASTA format) and some amount of binding sites in motif.txt. The thing I want to do is to count the length of the fragment for every gene if it has any of the binding sites from motif.txt. This current script does not work in the way I would like it. How should it be changed?
$string_filename = 'sequence.txt'; open(FILE, $string_filename) || die("Couldn't read file $string_filena +me\n"); $motif_filename = 'motif.txt'; open(MOTIF, $motif_filename) || die("Couldn't read file $motif_filenam +e\n"); local $/ = "\n>"; while (my $seq = <FILE>) { chomp $seq; $seq =~ s/^>*.+\n//; $seq =~ s/\n//g; $R = length $seq; $motif = <MOTIF>; chomp $motif; $motif =~ s/^>*.+\n//; $motif =~ s/\n//g; if ( $seq =~ /$motif/ ) { ## insert actual binding site $M = $'; $W = length $M; if ( $seq =~ /[A-Z]/) { ## exon start $K = $`; $Z = length $K; $x = $W + $Z - $R; print "\n\ the distance is the following: $x\n\n"; } else { print "\n\ I couldn't find the start codon.\n\n"; } } else { print "\n\ I couldn't find the binding site.\n\n"; } } close MOTIF; close FILE; exit;
Struggling this problem I recognised that it can be solved better using BioPerl module. I read carefully the installation instructions for Windows here http://www.bioperl.org/wiki/Installing_Bioperl_on_Windows#Installation_using_CPAN_or_manual_installation. It seems to be rather simple bur I failed plenty of times trying to do it. I used PPM in ActivePerl 5.12. I have added all the repository listed in the table. And when I select View >> All Packages there is no bioperl available. So this is a problem and I can't recognise what I am doing wrong. Will you be so kind to suggest we the way to modify my perl script or help with BioPerl installation? Thanks a lot!

Replies are listed 'Best First'.
Re: How to find any of many motifs?
by BrowserUk (Patriarch) on Jun 17, 2010 at 18:07 UTC

    While you wait for BioPerl to download and install, and then work out how to use it, this might do the job :)

    #! perl -slw use strict; open MOTIFS, '<', '845226.motif' or die $!; my @motifs = <MOTIFS>; close MOTIFS; chomp @motifs; open SEQS, '<', '845226.fasta' or die $!; while( my $seqID = <SEQS> ) { chomp $seqID; $seqID =~s[^>][]; my $seq = ''; { local $/ = ">"; chomp( $seq .= <SEQS> ); $seq =~ tr[\n][]d; } for my $motif ( @motifs ) { if( $seq =~ m[$motif([^A-Z]+)[A-Z]] ) { print "$seqID contains $motif; distance ", length( $1 ); } } } close SEQS; __END__ C:\test>845226 uc002yje.1 chr21:13973492-13976330 contains gccccac; distance 88 uc002yje.1 chr21:13973492-13976330 contains gccccac; distance 92 uc002yje.1 chr21:13973492-13976330 contains gggggaaaaaacc; distance 2 +0 uc002yje.1 chr21:13973492-13976330 contains gccccac; distance 90 uc002yje.1 chr21:13973492-13976330 contains agagggccc; distance 4

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks a million for your help. I hope I'll finish soon BioPerl installation. Your script is great and it does the job I tried to get from mine. But I have one problem: in the case when the gene doesn't contain any of motifs, the output looks like following:
      NM_002232 range=chr1:111015833-111020178 contains ; distance 1014
      So, as far as I understand it prints out the length of the whole region situated before the exon start point (all the small letters ). Actually this kind of information is not required in the task. How this can be skipped in the outpt? Thank you once more for your great contribution into my biologilcal invstigations %)!!! P.S: this morning I have modified my script so, it almost does the job too, but there are still some problems to solve. But your support has helped me a lot!

        Never mind. I think I see the problem.

        Your motifs file contains some blank lines. A blank line will become undef, and undef will match anything.

        A simple modification to my code to deal with this is:

        if( defined $motif and $seq =~ m[$motif([^A-Z]+)[A-Z]] ) {

        It would perhaps be better to deal with this at source--ie. remove blank lines from the motifs file--but data is rarely perfect in the real world, so the above fix is as good a way of dealing with it as any.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        But I have one problem: in the case when the gene doesn't contain any of motifs, the output looks like following:
        NM_002232 range=chr1:111015833-111020178 contains ; distance 1014

        So, as far as I understand it prints out the length of the whole region situated before the exon start point (all the small letters ). Actually this kind of information is not required in the task. How this can be skipped in the outpt?

        Hm. That shouldn't happen. It should only print something if a match is found.

        Could you please post the relevant motif & sequence that correspond to the above output so that I can use it to debug my code?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to find any of many motifs?
by toolic (Bishop) on Jun 17, 2010 at 16:01 UTC
    Regarding BioPerl, I have not used it, but you may be able to get more information by reading the Tutorial -> Perl and Bioinformatics

    If you still want to pursue developing your own script, you need to provide more information:

    • Post small, relevant samples of your input files, preferably less than 10 lines each.
    • Post the actual output you get, including any errors or warnings.
    • Post the output you want.
    • use strict and warnings

    The goal is to post something self-contained that anyone of us can download and run.

      Thank you for your quick resonse Here are some samples:
      genes saved in sequence file look like this: >hg18_refGene_NM_006511 range=chr1:15857951-15860804 cctagtcacctgctaattaatctttttcctttcccctgtgttccatatagagaaaagtggtaaagaatct +acctcactgggaATGTCATCATTACCAACTTCAGATGGGTTTAACCATCC >hg18_refGene_NR_027268 range=chr1:28294033-28296656 tttttttttttcttcttcttttttgtggagttgctttttttttttttcttcttttttgtgtcattgtttt +tttaccatgtcctcagagtctggaattttacaaagtagttgctgagtaaacaggTAGATACAAATCAAG +ATGAACTTTTCCGCAGGAGCTCTTATCATGAACAT and the binding motifs: >mot1 tttgtagatttt >mot2 ggatcctttaag >mot3 gtttaccccaa
        Here are the actual sequnces I analyse: for GENES:
        >uc002yje.1 chr21:13973492-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.1 chr21:13973492-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG >uc002yje.1 chr21:13973492-13976330 cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG for motifs: >ucmotif_1 gccccac >ucmotif_2 gggggaaaaaacc >ucmotif_3 agagggccc here is the output: the distance is the following: 88 the distance is the following: 20 the distance is the following: 4
        So, now the program searches for the first element in the first gene and for the second in the second and so on. I wish that it could find any of motifs for each gene if they present in it and count the length between the motif an exon start point (starts with capital letters).
Re: How to find any of many motifs?
by biohisham (Priest) on Jun 17, 2010 at 16:55 UTC
    Hello Nikulina, welcome to Perl Monks, take time trying to understand the requirements for BioPerl installation. I haven't tried CPAN manual installation for BioPerl since I find the PPM quite a stable one. Now, since you have added all the repositories mentioned in the table that is quite it!. Searching for All the packages from PPM would not yield BioPerl, since it is a Perl non-core module, so you need to search for it in the repositories you have connected to, AND then INSTALL it for it to be added to the modules you currently have. This in itself is not a big task if you are spending patient time understanding it...

    Check installation instructions, it got some links that are useful..

    In brief, after you have gotten familiar with what installation requirements are needed for your environment, invoke the PPM manager in shell mode by typing ppm-shell at your command prompt, then type search BioPerl, check if the search returned "Bundle-BioPerl" after that just type install Bundle-BioPerl, if that is not the case, then make sure that you are connected to all the repositories you have added to your package manager, by typing repo list and see how many enabled repositories are there... and enabling every disabled one by typing repo on followed by the repository id number..

    I find the PPM shell mode more robust than the GUI one..

    if ( $seq =~ /[A-Z]/) is better written as if ( $seq =~ /[AGCT]/) ...


    Excellence is an Endeavor of Persistence. A Year-Old Monk :D .
Re: How to find any of many motifs?
by biohisham (Priest) on Jun 17, 2010 at 19:14 UTC
    Using a BioPerl module to parse sequences files lifts away the burden of having to manually detect what constitutes a sequence start and end for many of the available formats, also, it takes care of handling the sequence objects and IO tasks efficiently and succinctly.

    While learning Perl, make sure you are learning it the right way by building on the good habits that would at the end lead you to becoming an established hacker, great emphasis is there to always

    use strict; use warnings;
    At the top of your programs, these stricture would enable you to save time debugging by giving you warnings about potential pitfalls and enforcing rules like "variable localization and scoping", that would really save you from a lot of errors. Also, make sure that your variables are named in a self-descriptive fashion and that your code is indented in a readable way.

    Here is how you can perform the same job by using the library Bio::SeqIO from the BioPerl suite.

    use strict; use warnings; use Bio::SeqIO; my $seq_file = 'D:/Motif Search/sequence.txt'; my $mot_file = 'D:/Motif Search/motif.txt'; my $in = Bio::SeqIO->new( #sequences object -format => 'fasta', -file => $seq_file, ); my $mot = Bio::SeqIO->new( #motifs object -format => 'fasta', -file => $mot_file ); my @motifs; while ( my $motif = $mot->next_seq ) { #read in motifs my $motif_string = $motif->seq; #stringify motifs push @motifs, $motif_string; } while ( my $seq = $in->next_seq ) { my $string = $seq->seq; foreach my $motif (@motifs) { if ( $string =~ /$motif([agct]+)[AGCT]/ ) { print "'$motif' ", length $1, " bases away at ", $seq->id, "\n"; } } }

    Have a great Perl journey ...

    Excellence is an Endeavor of Persistence. A Year-Old Monk :D .
Re: How to find any of many motifs?
by llancet (Friar) on Jun 18, 2010 at 13:56 UTC
    relevant resources are:
    Bio::SeqIO; Bio::PrimarySeqI; index($str,$substr,$start_position);
Re: How to find any of many motifs?
by llancet (Friar) on Jun 22, 2010 at 10:52 UTC
    Ok, a more detailed hint:
    1: read the motif file using Bio::SeqIO, and store those motif in a hash.
    2: traverse through the target sequence file usingBio::SeqIO. On each target seq, use index on each motif to find whether the seq has the motif.
    use strict; use Bio::SeqIO; my $file_seq = shift; my $file_motif = shift; # read all motif # store in hash: ID => motif_seq my %all_motif; my $MOTIF_FH = Bio::SeqIO->new(-file=>$file_motif); while (my $obj = $MOTIF_FH->next_seq) { $all_motif{$obj->display_id} = $obj->seq; } $MOTIF_FH->close; # traverse through all sequences my $SEQ_FH = Bio::SeqIO->new(-file=>$file_seq); while (my $obj = $SEQ_FH->next_seq) { my $id = $obj->display_id; my $seq = $obj->seq; print "Sequence: $id\n"; # traverse through all motifs foreach my $motif_id (sort keys %all_motif) { my @curr_result; # find all positions of current motif my $last_i = 0; while (1) { my $i = index($seq,$all_motif{$motif_id},$last_i); last if $i==-1; push @curr_result,$i+1; $last_i = $i; } # print if has result if (@curr_result>0) { print "\tmatch with $motif_id at: @curr_result\n"; } } } $SEQ_FH->close;