in reply to Re: Filter and writing error log file
in thread Filter and writing error log file

Dear All,

Thank you very much for your time and suggestions.

I agree Laurent R that these DNA files could be very long and loading them in an array at the beginning could pose a memory problems. This is the reason why I am trying to read the file in a while loop and checking for conditions. I tried to use if condition in the program

if (($seq =~/[A|T|G|C]/) && ($lenseq == 19)) { print "$seq\n"; } else {print "error log file";} # here I want to print those fragem +nts whose length is either less than or greater than 19 and if the fr +agments contains based other than [ATGC]

All this in a while loop so that I can read huge files without worrying about the memory issues.

Could it be possible to get some directions as to how to check those condition and only if the conditions are true the sequences are processed further.

Thanks to all of you

Replies are listed 'Best First'.
Re^3: Filter and writing error log file
by choroba (Cardinal) on Jul 23, 2014 at 13:29 UTC
    To check that a string contains something other than A, C, T, or G, search for the offending character, so in your condition, use
    $seq !~ /[^ACTG]/

    Note that | is not needed in a character class (in fact, it matches literally, so avoid it if you don't want to match it).

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thanks for the suggestions. One question, why we have to use '^' to match rather than

      [ATGC]
        See perlre. The carret negates the class, so the regular expression matches non-ACTG characters, but I used !~ to negate that. It's like the difference between

        "The sequence doesn't contain invalid characters"

        and

        "The sequence contains valid characters"

        These two are not equivalent, as the second lacks the work "only".
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      I have one loop related question. I have defined an array of alphabet from ("A" .. "Z") but after reading a long file the alphabets end and the program shows error of uninitialized values.

      My questions is how can I define an array of alphabets which can go to AA, BB, CC and ...so on when the "A" .. "Z" ends.

        Alphabet stands for the whole series of letters. Use the word "letter" for a single character like "A" or "Z", please, not alphabet.

        Don't create an array. Just start with

        my $letter = 'A';

        In every iteration, do

        $letter++;

        and let the magic do all the work.

        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Thanks for explaining the '^' behaviour and loop through the letters.

        I have some more doubts and questions regarding one of my another programs which I have written but its very crude and i want some help in making in more robust.

        Is it ok to post it here and get some help.

        Thanks
Re^3: Filter and writing error log file
by Laurent_R (Canon) on Jul 23, 2014 at 18:00 UTC
    Hi, you could have a number of next statements to discard records that are not good. For example:
    while (<$IN_FILE>) { chomp; next if /[^ACTG]/; # removes lines with other letters next if length != 19; # removes lines not 19 char long # I just made up the next rule for the example next if /(.)\1\1/; # removes lines where the same letter comes th +ree times in a row # etc. # now start doing the real processing # ... }
    The next statement goes directly to the next iteration of the while loop, so that faulty lines are effectively discarded early in the process.
      Thanks Laurent_R for the suggestion :-)