MaroonBalloon has asked for the wisdom of the Perl Monks concerning the following question:

Dear computer warriors, EDIT: Sorry, I am new to perl so I don't speak the language so fluently, but what I want to do is this.

I am using an organism called Klebsiella, Kleb for short.
There are many variants. Some are dangerous, and some are not.
I am looking at protein sequences for a protein we can just call protein A
from the dangerous ones (I have about 20 sequences of protein A for this set), as well as about 10 sequences for protein A in the non-dangerous ones.(I have about 30 sequences of protein A for this set).

From the command line I type: ./kleb.pl ATATATA(An amino acid sequence) THe output tells me something like:" There are 7/20 instances found in non-dangerous Kleb. There are 24/30 instances found in non-dangerous Kleb.
"
I will do some statistics on this as well.

What I am looking to do is automate this process so that PERL can merely give me a list of the most commonly found sequences in one or the other set for sequences greater than 4 characters long. It would also be nice to have the capability to do ATA??AA so that the matches don't have to be identical.

to answer bioHishHam, I will check out those modules, thanks!

To answer The Biology Man,
It doesn't matter if they are aligned, I just want to find these characters among them. I have very few sequences, and this data is somewhat confounded by the lineage versus function question, so I have to be able to look at these most common sequences and see if their presence is actually important or they are just a leftover that evoluation has not bothered to alter. In the abstract, a slider window is pretty much what I imagine vetting my sequences, but how to code for it is the problem. My problem is a real lack of perl experience.

The last comment would be helpful but the sequences are about 300 characters; I am looking to find small strings, 4,5,6,7 bp...

Replies are listed 'Best First'.
Re: FASTA riddle.
by biohisham (Priest) on Dec 11, 2009 at 01:32 UTC
    Thinking about it, an answer can be a really multi-approach one for Perl's text manipulation capabilities are many, though it seems you don't want scores or substitution matrices evaluation it'd still be better to not reinvent the wheel -unless you're practicing a concept- and try to look at available Sequence Alignment modules that can potentially provide this sort of task.

    Providing samples of sequences, expected output, and your code-efforts would greatly enhance the response you'd receive from the beloved Monks...


    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
Re: FASTA riddle.
by GrandFather (Saint) on Dec 11, 2009 at 01:07 UTC

    Not all the monks are biologists so you may get more millage from your question if you give a tiny amount of sample data and show the results you expect to generate. It would help too if you showed us what you've tried.


    True laziness is hard work
Re: FASTA riddle.
by korpenkraxar (Sexton) on Dec 11, 2009 at 14:34 UTC
    I do quite a bit of bioinformatics programming in Perl and would consider myself fairly experienced on these kind of problems. Still, even I do not understand what you want to accomplish.

    Are your paired sequences already aligned with other software? Are they amino acid or nucleotide sequences?

    Are you aiming to compare amino acid usage/signatures among sequences using sliding windows of various lengths? Do you plan to have the slider partially overlap previous windows as it slides along the sequences or not?

    Could you please elaborate a bit on your problem and workflow and preferably provide some code for us to have a go at?

Re: FASTA riddle.
by korpenkraxar (Sexton) on Dec 12, 2009 at 16:38 UTC

    Thanks for the more thorough description of your problem :-)

    From your example, it seems like you are doing this a little bit backwards compared to what you seem to aim for: you provide arbitrary motifs on the command line and have the program search for and count them, whereas we suggest you scan the sequences and count statistics for all motifs in one go by looking at sub-segments along the sequences.

    Have a look at Perls substr function to extract only a certain number of characters (a motif/window) from the whole sequence. With substr you can also use offsets to walk along the sequence, with or without overlap between the windows.

    As the Anonymous Monk said, you can then use that short string as a hash key. The first time you find a particular morif you simply put that as the key into the hash and assign the value 1 to it. You then increment the value by 1 with each subsequent hit.

    Things get *much* more difficult if you want to accomplish fuzzy matching (ATA??AA), because then you need to cluster motifs based on similarity, specify cutoffs and use substitution models. That is beyond the scope of your original question. If you aim to do that, and already have a fairly good idea of which relaxed motifs you are searching for, then your original approach is easier because you can use regular expressions with wildcards to find matching motifs. As educated_foo said, you might be better off trying to find some program that does that already.

    I can wholeheartedly recommend BioPerl over at http://www.bioperl.org/ for bioinformatics programming of this kind as it abstracts away some of the repetitive low-level tasks (such as parsing a FASTA file) and speeds up development. You need to know or be willing to learn object-oriented programming to benefit from it.

    Best of luck! Keep us posted about your progress and do not hesitate to ask further questions.

      Thanks for all the help!
      Here is a simple coding question I suspect you may be able to help me with.
      Here is a slice from my first subroutine.
      In it, I am going to try to basically put the first line of each fasta sequence into a key, and the coding lines of the fastA sequences into the hash.

      I am trying to use the /$search/gi code near the bottom in order to find how many times my search sequences are found amongst ALL the sequences.
      AKA, if I search for the amino acid "R", it should come up 350 times. However, $hits is instead equal to 55.
      Basically it stops searching once a line is classified as containing "R", and gives me the number of lines which satisfy the condition, rather than the amount of times the search sequence is found.


      while (my $line =<INPUT>) { if ($line =~m/^>/) { if ($seq) { $o{$header} = "$seq"; } $count++; print "\n\n"; print "$line"; $header = "$line"; $seq=(); } else { chomp $line; print $line; $seq .= "$line"; if ($line =~/$search/gi) { $hits++; } } }
        Alternatively, Is it possible to look at the values of a hash and find out how many times my search string is present?
        #input = $search foreach my $key (keys %hash) { $searchline = $hash{$key}; if ($searchline =~ /$search/gi) { $contains ++; } }
        Because this seems also to stop counting once it finds the FIRST $search, making the resulting "$contains" merely equal to the number of values that happen to contain at least one of the $search strings....
Re: FASTA riddle.
by educated_foo (Vicar) on Dec 12, 2009 at 14:08 UTC
    What you're doing is "motif finding," a well-studied and difficult problem. I suggest finding a software package or published algorithm to do it for you.
Re: FASTA riddle.
by Anonymous Monk on Dec 11, 2009 at 15:41 UTC
    Why not just sort the hash by the value of each key (the sequence) to find out how many there are?