tej has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: Recognize DNA and amino acid sequence
by ww (Archbishop) on Apr 23, 2011 at 01:59 UTC
    ...AND will dna and AMINO ACID SEQUENCES BE the only things in ALL CAPITAL letters in your data? What does it mean to "tag it?" How is your table structured?

    Perhaps those questions illustrate the inadequacy of your specification.

    Perhaps even more important, read about BioPerl, FASTA, and the vast array of other related options, many of which can be found using Google, site:CPAN, and appropriate search terms.

    If you're still stuck after that, c'mon back with a more detailed question about some particular part of your project where you still have a problem. We're here to help you learn how to use Perl and its modules, but we need a manageable (and well specified) problem... and greatly prefer to see your efforts and help you overcome stumbling blocks than to simply hand you a solution.

Re: Recognize DNA and amino acid sequence
by John M. Dlugosz (Monsignor) on Apr 23, 2011 at 01:49 UTC
    First read the table, or a portion of it. It's easiest if you are sure the sequence of interest won't cross a chunk boundary.

    Then use Perl's Pattern Matching feature to find each sequence of interest. By "tagging" do you mean insert a comment at that point where it was found? Use search-and-replace for that. Matching and replacing is a fundamental Perl primitive and one of the hallmarks of the language.

    Write the results out.

    BTW, I've heard of other people using Perl for DNA stuff, so look in CPAN and see if what you want is already there.

Re: Recognize DNA and amino acid sequence
by InfiniteSilence (Curate) on Apr 23, 2011 at 17:22 UTC

    Solution in three easy steps:

    • Uno: I think you are going to need to clearly describe what you mean by a sequence. For argument's sake I'll say you mean something like this, a sequence of capitalized letters (AGCTURYKMSWBDHVN) , one after another, followed by a single white space character (I borrowed this from here).
    • Dos: You may run into some problems using Perl with extremely large files. Try reading up more about this so you can divide up your problem (either the files themselves, rewriting some things in C and using XS, etc.). A really simple example using the file format from the previous link is here:
      use strict; my $seqNum = 0; my %sequences = (); open(H,qq|$ARGV[0]|) or die $!; while(<H>) { while (m/\b([AGCTURYKMSWBDHVN]+)\b/g) { $sequences{++$seqNum} = $1; } } close(H); for (sort {$a <=> $b} keys %sequences){print qq|$_\t$sequences{$_}\n|}
    • Tres: Here is the kicker. Just because these letters satisfy the regex doesn't mean that they necessarily are valid sequences. You will need to compare them against a powerful sequence database like BLAST. There are modules to perform searches written in Perl, but you should first become acquainted with a suite of tools specifically built for these kinds of problems called Bioperl.

    Celebrate Intellectual Diversity

Re: Recognize DNA and amino acid sequence
by planetscape (Chancellor) on Apr 24, 2011 at 06:58 UTC