http://qs1969.pair.com?node_id=288376


in reply to Re: Re: BioInformatics - polyA tail search
in thread BioInformatics - polyA tail search

This is completely untested, but you can use this as a start. Warning: This could be memory intensive for large files!
# Open the file and slurp the contents to a string. open FILE, "File_To_Read" || die "Cannot open 'File_To_Read' for readi +ng: $!\n"; my $file = do { $\ = undef; <FILE> }; close FILE; # Remove all the characters we don't care about. $file =~ s/[^ANGTC]//g; # Walk through the string, looking for matches. while ($file =~ /[AN]{10}/g) { print "$1\n"; }

You're going to have to add the loop around the files, add any letters you want to be allowed into the substitution, etc. You're also going to have to add handling if you don't want to see overlapping sequences. Good luck!

------
We are the carpenters and bricklayers of the Information Age.

The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Replies are listed 'Best First'.
Re: Re3: BioInformatics - polyA tail search
by fletcher_the_dog (Friar) on Sep 02, 2003 at 18:44 UTC
    In the definition of a ployA tail it says "as a string of length 10 or greater containing only 'A' or 'N' if you erase all unwanted characters then some 'A's and 'N's that weren't together before might come together. Also note that you probably want to match against [AN]{10,} so that if there are more than 10 A's or N's in a row the match does not fail. Also, MiamiGenome wanted the filenames. This modified version of your code might work a little better:
    # if your file extension is not .txt change it to whatever is approria +te while (my $filename=<*.txt>){ # Open the file and slurp the contents to a string. open FILE, $filename || die "Cannot open '$filename' for reading: $! +\n"; my $file = do { $\ = undef; <FILE> }; close FILE; # If a 'polyA' sequence is found print the file name. if ($file =~ /[AN]{10,}/) { print "$filename has a polyA tail sequence\n"; } }
      Also note that you probably want to match against [AN]{10,} so that if there are more than 10 A's or N's in a row the match does not fail.
      If there are more than ten, then {10} will match just fine.
        I wrote a little test script to test if you were right (and you were), so my question is what use is the upper range indicator? I thought it allowed you to limit the number of times that something matched, but apparently it does not.
        #!/usr/bin/perl use strict; my $seq = "ANANNNNANANANANANANANANANANA"; if ($seq=~/[AN]{10,11}?/) { print "I matched\n"; } else { print "I did not match!\n"; } __OUTPUT__ I matched