in reply to search and print in perl

For a start: put use strict; use warnings; at the beginning of each and every perl-script you write. It will save you much time and annoyances;

Next, don't use global variables if you can avoid it. Lexical variables (those that start with "my" are the way forward.

The open operator is best used with lexical variables and with 3 arguments:

my $filename = 'input.txt'; open (my $IN, '<', $filename) or die "Can't open file $filename : $! " +;
(BTW: the text input.txt does not need to be interpolated, so put it in single quotes.

Now you have to start reading in the contents of the file and much will depend on the format of your input.txt-file and I'm not sure that it is a good idea to concatenate the whole file into one scalar while still keeping the "End-Of-Line" characters. Perhaps you can show us a small excerpt of your input.txt-file? If you do that we can continue helping you.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Replies are listed 'Best First'.
Re^2: search and print in perl
by CountZero (Bishop) on Jun 01, 2009 at 12:05 UTC
    Is each "line" (defined as anything which is ended by \n) one gene? Or do you have to combine multiple lines to "make-up" one gene?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      No, A gene is like this: it is preceded by a string TATAAT and after this string there can be one or many strings of letters A,C,G,T . then ATG string follows them, then again random amount of A,C,G,T's follow it and the gene ends with one of the strings TAA, TGA or TAG. for example a line is TATAATATTACAATGGATCATACAGTTAG ... our gene is the part between ATG and TAG (ATGGATCATACAGTTAG here) but we also have to make sure it is preceded by a TATAAT.. I have to print out the genes in the txt file according to these rules.
        Incorporating citromatik's suggestion, we have now:
        use strict; use warnings; my $filename = 'input.txt'; open (my $IN, '<', $filename) or die "Can't open file $filename : $! " +; my $text; # no need to initialize it to the empty string! while($line = <$IN>) { chomp $line; ## <---- !!! $text .= $line; } while($text =~ m/TATAAT[ACGT]+?(ATG[ACGT]+?(?:TGA|TAG|TAA))/g) { print "$1\n"; }
        I have added some non capturing parentheses (the (?: ... ) around the TGA|TAG|TAA) and most important made the + quantifiers non-greedy by adding a ? as otherwise they would match too much and only return you the last gene in your string.</c>

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re^2: search and print in perl
by hellworld (Novice) on Jun 01, 2009 at 11:58 UTC
    Thanks. My input.txt is like the following: TGGTACGACCGAACGAAAGAAAAAGAACACACACTGACCGGAGGGGTTGAATTGTTTGCCTGGCAC it goes on like this, Random ACGT characters for whole 6 lines.

      Could you provide a sample of a subsequence that is supposed to match (there is no TATAAT in this sample)?  Also, may the subsequences in question be split across more than one line? In this case, you would need to chomp the input lines to get rid of the newlines in your $text (as you're not accounting for them in the regex).

      Also, if your input is then all on one line, you probably want non-greedy matching ([ACGT]+?) of the parts in between the strings of interest, or else they'll gobble up more than you want...

      it goes on like this, Random ACGT characters for whole 6 lines

      If the input is multi-line and you have to match your regexp against the full input (all lines concatenated), then you should strip off the newlines characters of each individual line:

      while($line = <IN>) { chomp $line; ## <---- !!! $text .= $line; }

      citromatik