hellworld has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, I'm working on a homework and I couldn't figure out the part where I'm supposed to print all the genes in a text file. the genes start with ATG and end with TGA or TAG or TAA strings.. the genes are also supposed to come after TATAAT string.. I've searched the file I'm supposed to read and found two genes but the code I've written doesn't print them out. Also, between the TATAAT string and the ATG start codon, at least one of A,C,G,T strings are present, usage of acgtACGT+ is for that.My code is:
$filename = "input.txt"; open (IN, $filename) or die "Can't open file $filename : $! "; $text = ""; while($line = <IN>) { $text .= $line; } while($text =~ m/(TATAAT[ACGT]+(ATG[ACGT]+(TGA|TAG|TAA)))/g) #we check + for DNA Strands along with TATA box #promoters..(TATAAT string) { print "$2\n"; }
Any info would be helpful, thanks.

Replies are listed 'Best First'.
Re: search and print in perl
by CountZero (Bishop) on Jun 01, 2009 at 11:53 UTC
    For a start: put use strict; use warnings; at the beginning of each and every perl-script you write. It will save you much time and annoyances;

    Next, don't use global variables if you can avoid it. Lexical variables (those that start with "my" are the way forward.

    The open operator is best used with lexical variables and with 3 arguments:

    my $filename = 'input.txt'; open (my $IN, '<', $filename) or die "Can't open file $filename : $! " +;
    (BTW: the text input.txt does not need to be interpolated, so put it in single quotes.

    Now you have to start reading in the contents of the file and much will depend on the format of your input.txt-file and I'm not sure that it is a good idea to concatenate the whole file into one scalar while still keeping the "End-Of-Line" characters. Perhaps you can show us a small excerpt of your input.txt-file? If you do that we can continue helping you.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Is each "line" (defined as anything which is ended by \n) one gene? Or do you have to combine multiple lines to "make-up" one gene?

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        No, A gene is like this: it is preceded by a string TATAAT and after this string there can be one or many strings of letters A,C,G,T . then ATG string follows them, then again random amount of A,C,G,T's follow it and the gene ends with one of the strings TAA, TGA or TAG. for example a line is TATAATATTACAATGGATCATACAGTTAG ... our gene is the part between ATG and TAG (ATGGATCATACAGTTAG here) but we also have to make sure it is preceded by a TATAAT.. I have to print out the genes in the txt file according to these rules.
      Thanks. My input.txt is like the following: TGGTACGACCGAACGAAAGAAAAAGAACACACACTGACCGGAGGGGTTGAATTGTTTGCCTGGCAC it goes on like this, Random ACGT characters for whole 6 lines.

        Could you provide a sample of a subsequence that is supposed to match (there is no TATAAT in this sample)?  Also, may the subsequences in question be split across more than one line? In this case, you would need to chomp the input lines to get rid of the newlines in your $text (as you're not accounting for them in the regex).

        Also, if your input is then all on one line, you probably want non-greedy matching ([ACGT]+?) of the parts in between the strings of interest, or else they'll gobble up more than you want...

        it goes on like this, Random ACGT characters for whole 6 lines

        If the input is multi-line and you have to match your regexp against the full input (all lines concatenated), then you should strip off the newlines characters of each individual line:

        while($line = <IN>) { chomp $line; ## <---- !!! $text .= $line; }

        citromatik

Re: search and print in perl
by Bloodnok (Vicar) on Jun 01, 2009 at 11:59 UTC
    ...working on a homework... - $honesty++.

    Some sample data, outlining both matching and non-matching cases, would help - IMO, your description doesn't quite cut it for me - I'm barely a proficient programmer, I'm most certainly not a geneticist e.g. in this case, is a gene represented by a single char, a sequence of chars, or both?

    That being said...

    1. This:

    . . $text = ""; while($line = <IN>) { $text .= $line; } . .
    is more usually (and in most cases, better) written as...
    local $/; # Ensure line-ends are ignored $text = <IN>; . .
    2. AFAICT i.e. subject to further details being provided, your RE appears to only capture start & end delimiters.

    A user level that continues to overstate my experience :-))
      Thanks. A gene is like this: it is preceded by a string TATAAT and after this string there can be one or many strings of letters A,C,G,T . then ATG string follows them, then again random amount of A,C,G,T's follow it and the gene ends with one of the strings TAA, TGA or TAG. for example a gene is TATAATATTACAATGGATCATACAGTTAG ... our gene is the part between ATG and TAG but we also have to make sure it is preceded by a TATAAT.. I have to print out the genes in the txt file according to these rules.
        Assuming a definition per line i.e. not split over multiple lines, then...
        use warnings; use strict; local $/; my $data = <DATA>; while ($data =~ /TATAAT[ACGT]+ATG([ACGT]+)(:?T(:?GA|AA|AG))/cgs) { warn $1; } __DATA__ TATAATATTACAATGGATCATACAGTTAG TATAATATTACAATGGATCATACAGTTAG TATAATATT ACAATGGATCATACAGTTAG
        produces:
        $ perl tst.pl GATCATACAGT at tst.pl line 8, <DATA> chunk 1. GATCATACAGT at tst.pl line 8, <DATA> chunk 1. $
        A user level that continues to overstate my experience :-))
Re: search and print in perl
by wol (Hermit) on Jun 01, 2009 at 14:13 UTC
    Homework.

    Genetic engineering.

    Disturbing?

    --
    use JAPH;
    print JAPH::asString();