in reply to BioInformatics - polyA tail search

We are not all bioinformatics specialists, but we know Perl. Could you please let us know (1) what your data files contain, (2) what a polyA tail would look like in that data, (3) what you've tried so far.

How to ask questions the smart way.

[ e d @ h a l l e y . c c ]

Replies are listed 'Best First'.
Re: Re: BioInformatics - polyA tail search
by MiamiGenome (Sexton) on Sep 02, 2003 at 18:17 UTC
    I do apologize for the sparse details.

    A polyA tail can be defined as a string of length 10 or greater containing only 'A' or 'N'.

    For example,




    The files to search contain 80 character lines terminated with a control-M, as follows :


    Thus far, I have used only grep, but I think the control-M may be imbedded in some of the poly A stretches.

    Thank you again!

      This is completely untested, but you can use this as a start. Warning: This could be memory intensive for large files!
      # Open the file and slurp the contents to a string. open FILE, "File_To_Read" || die "Cannot open 'File_To_Read' for readi +ng: $!\n"; my $file = do { $\ = undef; <FILE> }; close FILE; # Remove all the characters we don't care about. $file =~ s/[^ANGTC]//g; # Walk through the string, looking for matches. while ($file =~ /[AN]{10}/g) { print "$1\n"; }

      You're going to have to add the loop around the files, add any letters you want to be allowed into the substitution, etc. You're also going to have to add handling if you don't want to see overlapping sequences. Good luck!

      We are the carpenters and bricklayers of the Information Age.

      The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        In the definition of a ployA tail it says "as a string of length 10 or greater containing only 'A' or 'N' if you erase all unwanted characters then some 'A's and 'N's that weren't together before might come together. Also note that you probably want to match against [AN]{10,} so that if there are more than 10 A's or N's in a row the match does not fail. Also, MiamiGenome wanted the filenames. This modified version of your code might work a little better:
        # if your file extension is not .txt change it to whatever is approria +te while (my $filename=<*.txt>){ # Open the file and slurp the contents to a string. open FILE, $filename || die "Cannot open '$filename' for reading: $! +\n"; my $file = do { $\ = undef; <FILE> }; close FILE; # If a 'polyA' sequence is found print the file name. if ($file =~ /[AN]{10,}/) { print "$filename has a polyA tail sequence\n"; } }

      Color me confused, but

      a) I thought that genome sequences consisted of ACGT.

      b) Your example sequences do not contain any 'N's.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

        You are correct, but when the automated sequencers can not unambiguously choose the A,C,T,or G, they assign 'N'.

        My chosen example sequences were from the beginning of a sequence file. PolyA stretches are found at the end.

        Cheers, and thank you in advance!

        Genome sequence do consist of A's, C's, G's and T's, but sequencing a section of DNA can sometimes give ambiguous results, especially near the end of a sequence. If you can't tell what nucleotide occured at what position in the sequence, it's common practice to use N to denote an ambiguous base.

        Lately, sequencing has advanced to the point that while there is still ambiguity in sequencing, the sequencer can narrow down the possible bases. There is a way of denoting this ambiguity with a one-letter code from the IUB ambiguity codes table.

        So, like anything in biology there is ambiguity (sometimes). One might even go as far as to say that there is ambiguity about when there is ambiguity, because most sequences that I work on don't contain ambiguous bases. However, the concept of ambiguity can be very useful in a laboratory setting for reasons that are way beyond the scope of this discussion.

        Hopefully this cleared up some of the confusion. :)

      Are you sure that's what you want? You know your problem-space, so I'm just checking. There are three issues I see:

      1. Fasta doesn't have to be 80 characters per-line
      2. Fasta doesn't have to be ctrl-M delimited
      3. Are you sure you don't want to check strand identity (e.g. look for poly-T sequences)?

      Either way, the simple answer is to parse the file using SeqIO (BioPerl). Extract the sequence, then run a reg-ex to look for what you want. In your case it's $seq =~ /[AN]{10,}/ I think.

Re: Re: BioInformatics - polyA tail search
by Anonymous Monk on Sep 02, 2003 at 18:10 UTC
    If only there was a bioinformatics monks!