http://qs1969.pair.com?node_id=288378


in reply to Re: Re: BioInformatics - polyA tail search
in thread BioInformatics - polyA tail search

Color me confused, but

a) I thought that genome sequences consisted of ACGT.

b) Your example sequences do not contain any 'N's.

?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

Replies are listed 'Best First'.
Re: Re: Re: Re: BioInformatics - polyA tail search
by MiamiGenome (Sexton) on Sep 02, 2003 at 18:36 UTC
    You are correct, but when the automated sequencers can not unambiguously choose the A,C,T,or G, they assign 'N'.

    My chosen example sequences were from the beginning of a sequence file. PolyA stretches are found at the end.

    Cheers, and thank you in advance!

      Something like this will get you started.

      perl -nle" print "$ARGV:($./$+[0]): $1" if m[([AN]{10,}]g;" file*

      This will print lines like

      filename:(10/50): ANNNANANAAN

      where the first number is theline in the file and the second is the offset within the line.

      For unix you need to swap "s for 's, and under Win32 you would need to add BEGIN{ @ARGV=map{ glob } @ARGV } to expand the wildcard filespec supplied on the comand line.

      See perlrun for the switches used, and perlre for the regex.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: Re: Re: Re: BioInformatics - polyA tail search
by biosysadmin (Deacon) on Sep 03, 2003 at 09:27 UTC
    Genome sequence do consist of A's, C's, G's and T's, but sequencing a section of DNA can sometimes give ambiguous results, especially near the end of a sequence. If you can't tell what nucleotide occured at what position in the sequence, it's common practice to use N to denote an ambiguous base.

    Lately, sequencing has advanced to the point that while there is still ambiguity in sequencing, the sequencer can narrow down the possible bases. There is a way of denoting this ambiguity with a one-letter code from the IUB ambiguity codes table.

    So, like anything in biology there is ambiguity (sometimes). One might even go as far as to say that there is ambiguity about when there is ambiguity, because most sequences that I work on don't contain ambiguous bases. However, the concept of ambiguity can be very useful in a laboratory setting for reasons that are way beyond the scope of this discussion.

    Hopefully this cleared up some of the confusion. :)

      Another important use of ambiguity is that sometimes you search sequence ambiguously. So even though the raw DNA only contains {GCAT}, sometimes you are looking for sequences that, at some point, can have restricted sets like {GC} or {AT}.