in reply to FASTA riddle.

Thanks for the more thorough description of your problem :-)

From your example, it seems like you are doing this a little bit backwards compared to what you seem to aim for: you provide arbitrary motifs on the command line and have the program search for and count them, whereas we suggest you scan the sequences and count statistics for all motifs in one go by looking at sub-segments along the sequences.

Have a look at Perls substr function to extract only a certain number of characters (a motif/window) from the whole sequence. With substr you can also use offsets to walk along the sequence, with or without overlap between the windows.

As the Anonymous Monk said, you can then use that short string as a hash key. The first time you find a particular morif you simply put that as the key into the hash and assign the value 1 to it. You then increment the value by 1 with each subsequent hit.

Things get *much* more difficult if you want to accomplish fuzzy matching (ATA??AA), because then you need to cluster motifs based on similarity, specify cutoffs and use substitution models. That is beyond the scope of your original question. If you aim to do that, and already have a fairly good idea of which relaxed motifs you are searching for, then your original approach is easier because you can use regular expressions with wildcards to find matching motifs. As educated_foo said, you might be better off trying to find some program that does that already.

I can wholeheartedly recommend BioPerl over at http://www.bioperl.org/ for bioinformatics programming of this kind as it abstracts away some of the repetitive low-level tasks (such as parsing a FASTA file) and speeds up development. You need to know or be willing to learn object-oriented programming to benefit from it.

Best of luck! Keep us posted about your progress and do not hesitate to ask further questions.

Replies are listed 'Best First'.
Re^2: FASTA riddle.
by MaroonBalloon (Acolyte) on Dec 12, 2009 at 22:59 UTC
    Thanks for all the help!
    Here is a simple coding question I suspect you may be able to help me with.
    Here is a slice from my first subroutine.
    In it, I am going to try to basically put the first line of each fasta sequence into a key, and the coding lines of the fastA sequences into the hash.

    I am trying to use the /$search/gi code near the bottom in order to find how many times my search sequences are found amongst ALL the sequences.
    AKA, if I search for the amino acid "R", it should come up 350 times. However, $hits is instead equal to 55.
    Basically it stops searching once a line is classified as containing "R", and gives me the number of lines which satisfy the condition, rather than the amount of times the search sequence is found.


    while (my $line =<INPUT>) { if ($line =~m/^>/) { if ($seq) { $o{$header} = "$seq"; } $count++; print "\n\n"; print "$line"; $header = "$line"; $seq=(); } else { chomp $line; print $line; $seq .= "$line"; if ($line =~/$search/gi) { $hits++; } } }
      Alternatively, Is it possible to look at the values of a hash and find out how many times my search string is present?
      #input = $search foreach my $key (keys %hash) { $searchline = $hash{$key}; if ($searchline =~ /$search/gi) { $contains ++; } }
      Because this seems also to stop counting once it finds the FIRST $search, making the resulting "$contains" merely equal to the number of values that happen to contain at least one of the $search strings....