newtoperlprog has asked for the wisdom of the Perl Monks concerning the following question:

Dear All,

I have written a small piece of code which reads in 19bp fragment of DNA string and does some calculations and prints the results.

What I am trying to do is to apply a filter to the code such that if the target sequence (column 3 in sample data file) is less than or greater than 19bp and/or if the fragment contains letters other than A or T or G or C, it should write a log file with the errors showing sequence and error value of length. The program would do calculation only if the target sequence pass the above criteria. Could it be done while reading the file in while loop ( as I have tried below in the code)? Any help is greatly appreciated and I am also attaching the output form the program.

#!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0]; my @file = split(/\./,$file); my $name = $file[0]; my $splitcolumn = $ARGV[1]; # splitcolumn value is 2 if ($#ARGV != 1 ) { print "Usage: Program_name filename column_number\n"; exit; } open (X, ">", $name."_detailed_info.txt"); open (Y, ">", "temp_".$name."_concatenate.dat"); open (ERR, ">",$name."_log.file"); my @alphabets = ("A".."H"); my @numbers = (1..12); my $cellno = 1; my $count=1; my $alphacounter=0; my $numcounter = 0; my $targetlen = 19; my $errseq=1; my $errlen =1; my @seq; #print X "#Name\tTarget_Seq(5'->3')\tTarget_len\tTarget_GC%\tShort_mRN +A(5'->3')\tSense_Strand\tSS_len\tSS_GC%\tAntiSense_Strand(5'->3')\tAn +it_Sense\tAS_len\n"; open (A, "<","$file") or die "Could not open the file, $!"; while(my $line = <A>) { chomp $line; if ($line =~/^\s*$/) {next;} my @temp = split(/\s/,$line); # my $seq = uc($temp[2]); #Target mRNA, length 19 bp my $seq = uc($temp[$splitcolumn]); #Target mRNA, length 19 bp my @tempseq = split (//,$seq); foreach my $tempseq(@tempseq) { if ($tempseq !~/[ATGC]/ig) { print ERR "Error: Target sequence # $errseq: [".$seq."] ha +s bases other than ATGC.\n"; next; } } my $lenseq = length($seq); if ($lenseq < $targetlen || $lenseq > $targetlen) { print ERR "Error: Target sequence # $errseq: [".$seq."] is + of ".length($seq)."-bp.\n"; next; } $errseq++; my $a = ($seq=~tr/A//); my $t = ($seq=~tr/T//); my $g = ($seq=~tr/G//); my $c = ($seq=~tr/C//); my $GCper = sprintf ("%.2f", (($g+$c)/$lenseq)*100); my $ATper = sprintf ("%.2f", (($a+$t)/$lenseq)*100); my $mrna = $seq; $mrna =~ s/T/U/g; $mrna =~ /\w{3}(.*)\w{1}/; my $shortmrna = $1; #Short mRNA, first 3 and last 1 base removed # print "$shortmrna\n"; my $lenshortmrna = length ($shortmrna); my $aa = ($shortmrna=~tr/A//); my $uu = ($shortmrna=~tr/U//); my $gg = ($shortmrna=~tr/G//); my $cc = ($shortmrna=~tr/C//); my $GCCper = sprintf ("%.2f", (($gg+$cc)/$lenshortmrna)*100); my $ATTper = sprintf ("%.2f", (($aa+$uu)/$lenshortmrna)*100); my $antisense = $seq; $antisense =~ tr/ATGCatgc/UACGuacg/; my $reverse = "AA".reverse($antisense); #Antisense strand made fro +m target mRNA, added two A's at beginning. # print "$reverse\n"; my $lenreverse = length($reverse); print X "$name$count\t$seq\t$lenseq\t$GCper\t$shortmrna\tSS\t$lens +hortmrna\t$GCCper\t$reverse\tAS\t$lenreverse\n"; #Detailed informatio +n print print Y "$name$count\tSS\t$shortmrna\n"; #tempfile print print Y "$name$count\tAS\t$reverse\n"; #tempfile print $count++; } close (A); close (X); close (Y);
K-ras 132 caggctcaggacttagcaa GCUCAGGACUUAGCA AAUUGCUAAGUCCU +GAGCCUG 47.6 41.9 19 15 21 K-ras 135 cttagcaagaagttatgaaa AGCAAGAAGUUAUGG AAUCCAUAACUUCU +UGCUAAG 33.3 36.5 19 15 21 K-ras 133 ggcycaggacttagcaaga UCAGGACUUAGCAAG AAUCUUGCUAAGUC +CUGAGCC 47.6 39.2 19 15 21 K-ras 134 caggacttagcaaoooaagtt GACUUAGCAAGAAGU AAAACUUCUUGCUA +AGUCCUG 38.1 36.5 19 15 21
My output: Error: Target sequence # 2: [CTTAGCAAGAAGTTATGAAA] is of 20-bp. Error: Target sequence # 2: [GGCYCAGGACTTAGCAAGA] has bases other than + ATGC. Error: Target sequence # 3: [CAGGACTTAGCAAOOOAAGTT] has bases other th +an ATGC. Error: Target sequence # 3: [CAGGACTTAGCAAOOOAAGTT] has bases other th +an ATGC. Error: Target sequence # 3: [CAGGACTTAGCAAOOOAAGTT] has bases other th +an ATGC. Error: Target sequence # 3: [CAGGACTTAGCAAOOOAAGTT] is of 21-bp.

Replies are listed 'Best First'.
Re: Filter and writing error log file
by Laurent_R (Canon) on Jul 22, 2014 at 21:44 UTC
    Could it be done while reading the file in while loop ( tas I have tried below in the code)?

    Yes, by all means, not only it can be done this way, but this is most often the best way you can do it (i.e. when the current data under examination doesn't need to know about the previous or next data chunk to be validated, examined or used). Especially with DNA files which, as far as I know(I am not a bio guy), can be very large. Using a while loop on your file (i.e. a file iterator) makes it possible for you to read absolutely huge files without ever running into out-of-memory problems (it might take time, but at least you have a very high probability of running your program to the end).

    I am working almost daily with huge files (typically between 3 and 15 GB, sometimes as much as 200 GB). In such cases, slurping the file into memory is just not an option, my program would die. Reading it with an iterator (a 'while (my $line = <$IN>) {' type of construct) is the only solution, and it does not use more memory than the size needed for the longest line.

    I had a relatively similar problem over the last days and found the solution this morning. A proprietary database (with no Perl module/driver), but implementing a protocol similar to SQL. I needed to load a rather large quantity of data into memory, and then process the main table using the data in memory. My first attempt last week ran out of memory and the program crashed. Before trying to load the main table, my original program was already using 155,000 blocks of memory (my best guess is that a memory block is 8 kB, but not sure). Anyway, after having loaded 155,000 blocks, trying to load the main table failed for lack of memory. After some experimentation, I was able to reduce memory consumption (changing hashes of hashes to hashes of strings), but the main improvement was to be able to use an iterator on the main table with a syntax as follows:

    open my $invoice_lines, "-|", $command or die "could not fork $!"; while (my $inv_line = <$invoice_lines>) { #...
    Having done these changes, my program is never using more than 62,000 blocks, so that it can be considered as fairly safe.

    Other comments on your code: your identifiers are very poor, X, Y and A don't say anything about the content. Similarly, $a, $t, $g and $c may seem OK when talking about DNA, I would suggest you use at least two letters for better identification. Also, the $a (and also $b) variable has a special meaning in Perl (used especially for sorting) and should probably be avoided for other purposes.

    Not sure I answered your question, but not sure about what your question really was. I hope that I gave at least some indications.

      Dear All,

      Thank you very much for your time and suggestions.

      I agree Laurent R that these DNA files could be very long and loading them in an array at the beginning could pose a memory problems. This is the reason why I am trying to read the file in a while loop and checking for conditions. I tried to use if condition in the program

      if (($seq =~/[A|T|G|C]/) && ($lenseq == 19)) { print "$seq\n"; } else {print "error log file";} # here I want to print those fragem +nts whose length is either less than or greater than 19 and if the fr +agments contains based other than [ATGC]

      All this in a while loop so that I can read huge files without worrying about the memory issues.

      Could it be possible to get some directions as to how to check those condition and only if the conditions are true the sequences are processed further.

      Thanks to all of you

        To check that a string contains something other than A, C, T, or G, search for the offending character, so in your condition, use
        $seq !~ /[^ACTG]/

        Note that | is not needed in a character class (in fact, it matches literally, so avoid it if you don't want to match it).

        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Hi, you could have a number of next statements to discard records that are not good. For example:
        while (<$IN_FILE>) { chomp; next if /[^ACTG]/; # removes lines with other letters next if length != 19; # removes lines not 19 char long # I just made up the next rule for the example next if /(.)\1\1/; # removes lines where the same letter comes th +ree times in a row # etc. # now start doing the real processing # ... }
        The next statement goes directly to the next iteration of the while loop, so that faulty lines are effectively discarded early in the process.
Re: Filter and writing error log file
by trippledubs (Deacon) on Jul 22, 2014 at 19:45 UTC
    This uses a hash to list each criteria test. Per each line of the file, the third column must pass each test for the code to reach the bottom portion. One test, that each input must be 19 characters long, is inline in the test_criteria hash. The other test, must contain other than atgc characters has been separated into its own subroutine.
    #!/usr/bin/env perl use strict; use warnings; sub not_only_atgc { my $sequence = shift; if ( $sequence =~ /[^atgc]/i ) { return 1; } return 0; } my %test_criteria = ( "Must not be 19 characters long" => sub { return 1 if (length($_[0]) != 19); }, "Must contain other than atgc" => \&not_only_atgc, ); my $lineno = 0; # Read one line at a time open( my $fh, "data.txt" ) or die "$!"; LINE: while ( my $line = <$fh> ) { $lineno++; my @words = split /\s+/, $line; print "Processing line $lineno interested in $words[2]\n"; TEST: for my $criteria_check ( keys %test_criteria ) { if ( $test_criteria{$criteria_check}->( $words[2] ) ) { # Test + returned 1 } else { print "Fail: $criteria_check\n\n"; next LINE; } } ## All checks have passed after this line ## print "$words[2] passes all criteria\n\n"; }
    Updated -- OUTPUT
    Processing line 1 interested in caggctcaggacttagcaa Fail: Must contain other than atgc Processing line 2 interested in cttagcaagaagttatgaaa Fail: Must contain other than atgc Processing line 3 interested in ggcycaggacttagcaaga Fail: Must not be 19 characters long Processing line 4 interested in caggacttagcaaoooaagtt caggacttagcaaoooaagtt passes all criteria
Re: Filter and writing error log file
by Anonymous Monk on Jul 22, 2014 at 19:37 UTC
    Split your script into several subroutines and you'll easily see how to do what you want. Also, perltidy will make your program look less messy.