artu5 has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone! First of all, I'm sorry for my English, I'm not mothertongue and I hope what I will write is understandable. I'm writing you for an help: I'm studying for the first time in my life bioinformatics and Perl and I wrote this program but there are two things I can't do. Could you help me? Here it is my program:
#!/usr/bin/perl # quality trimming of fastq file print "Write the window length\n"; $kmer=<STDIN>; chomp ($kmer); print "Write the minimum quality score cut-off\n"; $cut_off=<STDIN>; chomp ($cut_off); $file = "lib-pool-fosmid.fastq"; if (open(FASTQ,$file)) { while($header1=<FASTQ>) { $dna=<FASTQ>; $header2=<FASTQ>; $qual=<FASTQ>; @dna=split ('', $dna); @qual=split ('', $qual); @num_value=(); @scores=(); foreach $qscore (@qual) { $num_value=ord($qscore)-33; push(@num_value, $num_value); } foreach $value (@num_value) { if ($value<$cut_off) { pop(@num_value); + }else{ last; } } $sub1=substr($dna,-@num_value); @qscopy=reverse @num_value; foreach $value (@qscopy) { if ($value<$cut_off) { pop(@qscopy); }else{ last; } } $sub2=substr($sub1,0,@qscopy); @sub2=split('',$sub2); @qscopy=reverse @qscopy; @final_values=(); for ($i=0;$i<@qscopy-($kmer-1);$i=$i+$kmer) { @scores=@qscopy[$i..$i+$kmer-1]; $sum=0; foreach $score (@scores) { $sum+=$score; } if (($sum/$kmer)>$cut_off) { push(@final_values,@sub2[$i..$i+$kmer-1]); } } print "$header1"; foreach $base (@final_values) { print "$base"; } if (@final_values != 0){ print "\n\n"; } } }else{ print "Error! Can't find the file\n"; }
First, I don't know how I could write my output in order to still have a fastq format, written in a file and not on the screen. Then, I'm not so conveinced about my output... printed sequences have the same lenght. Maybe there is a problem in the sliding window operation but I don't know where it is. I know if I start with + = good and - = bad, my result should be: +++--++ input +++ output and not +++--++ input +++++ output I don't know how I could stop at the first "bad" window... Could you help me? Thanks a lot Artu5

Replies are listed 'Best First'.
Re: Quality trimming of fastq file
by AnomalousMonk (Archbishop) on Jan 08, 2018 at 20:55 UTC

    I strongly agree with Cristoforo's advice to avoid modifying the size of an array over which one is iterating; see Foreach Loops:

    If any part of LIST is an array, "foreach" will get very confused if you add or remove elements within the loop body, for example with "splice". So don't do that.
    My advice to anyone who describes him- or herself as "... studying for the first time in my life ... Perl ..." is to shun such an unclean practice.

    I don't understand the nature and structure of the data in the array being processed by the two foreach-loops in question. If you could describe this data and what you are trying to achieve (with input-data and output-data examples), perhaps we could suggest a more straightforward (and safer!) way to accomplish the same end. See Short, Self-Contained, Correct Example.

    Update: I wonder if you are aware that foreach iterates over a list (which may contain one or more arrays or consist entirely in an array) from lowest to highest list item index, while pop removes the highest index element from an array? The two foreach-loops in question are testing elements in an array from low to high and removing the highest elements.

    c:\@Work\Perl\monks>perl -wMstrict -le "my @ra = 1 .. 9; foreach my $element (@ra) { printf qq{<$element> }; } print qq{\n}; ;; pop @ra; ;; foreach my $element (@ra) { printf qq{<$element> }; } print qq{\n}; " <1> <2> <3> <4> <5> <6> <7> <8> <9> <1> <2> <3> <4> <5> <6> <7> <8>


    Give a man a fish:  <%-{-{-{-<

Re: Quality trimming of fastq file
by Cristoforo (Curate) on Jan 08, 2018 at 19:44 UTC
    Hi,

    What does the input file look like? Are any of these formats in the form that you have?

    One problem I see is you are altering an array with pop while iterating over the array and this will likely cause problems.

    Your code:

    foreach $value (@num_value) { if ($value<$cut_off) { pop(@num_value); + }else{ last; } } $sub1=substr($dna,-@num_value); @qscopy=reverse @num_value; foreach $value (@qscopy) { if ($value<$cut_off) { pop(@qscopy); }else{ last; } }
    First you iterate over @num_value popping off elements (wrongly I believe) and then you reverse the array into @gscopy and pop again, (wrongly again I believe), if the amount is less than $cut_off.
Re: Quality trimming of fastq file
by NetWallah (Canon) on Jan 08, 2018 at 21:14 UTC
    For a less error-prone approach, you should be using Bio::SeqIO::fastq for fastaq formatted I/O.

                    We're living in a golden age. All you need is gold. -- D.W. Robertson.

Re: Quality trimming of fastq file
by Laurent_R (Canon) on Jan 09, 2018 at 00:09 UTC
    Without seeing your input data (although we can probably guess at least the input format from the link provided by NetWallah) and especially without knowing the output that you want, it is very difficult for us to help you. So, please help us to help you, provide some sample input and desired output. And also explain the desired output if it is not obvious.

    I'll not comment further on the fact that you're popping values from the array being visited by your foreach statement (and the fact that it seems weird that you are using pop in that context where it would seem that you probably want to shift values in that context, but you shouldn't do it anyway), I guess the point has been made clear.

    I don't know how I could write my output in order to still have a fastq format, written in a file and not on the screen.
    If you don't know how to write to a file, then you simply need to open the file at the beginning of your script:
    my $outfile = "result.txt"; open my $OUT, ">", $outfile or die "could not open $outfile $!"; # idi +omatic way to open a file for writing
    This line opens file result.txt for writing (">") and assigns it to the file handle $OUT.

    Once this is done (do that only once), anytime you want to write to this file just add the $OUT file handle to your print statement:

    print $OUT "text to be printed to the file\n";
    At the end, it is good practice to close the file:
    close $OUT;
    although it is often not really necessary (Perl will usually close it for you when exiting, in some cases even before that).

    I hope this helps you getting forward.

    There are many other things to be said about your code, but I will only add one thing. You should have these two lines (called pragmas) at the top of your script:

    use strict; use warnings;
    This will force you to declare each variable (usually with the my keyword). For example, taking the first few lines of your script:
    #!/usr/bin/perl use strict; use warnings; # quality trimming of fastq file print "Write the window length\n"; my $kmer = <STDIN>; chomp $kmer; print "Write the minimum quality score cut-off\n"; my $cut_off = <STDIN>; chomp $cut_off; my $file = "lib-pool-fosmid.fastq";
    You might initially think it is an annoyance to have to declare each variable before you use it, because you'll get a lot of errors when trying to run your program once you've added these two pragmas, but you'll learn pretty soon that this is in fact very helpful: Perl will tell you early (often at compile time) about many errors that you might make and that would otherwise go unnoticed until much later at execution time (and it is generally much better to see the errors early).

    The point is that, if you use these pragmas, Perl will find many of your errors for you and tell you about them, and you'll save a lot of time at the end of the day.

    Update: removed a silly syntax mistake in the file opening statement. Thanks to AnomalousMonk for pointing it out.

Re: Quality trimming of fastq file
by poj (Abbot) on Jan 09, 2018 at 07:55 UTC

    My best guess at what you are trying to do

    #!/usr/bin/perl use strict; # quality trimming of fastq file print "Write the window length\n"; my $kmer = 3;#<STDIN>; chomp ($kmer); print "Write the minimum quality score cut-off\n"; my $cut_off = 21;#<STDIN>; chomp ($cut_off); my $file = "lib-pool-fosmid.fastq"; open FASTQ,'<',$file or die "Could not open $file : $!"; open OUT,'>','trimmed_'.$file; my @seq = (); while (my $header1 = <FASTQ>){ # input 4 line records my @rec = (); $rec[0] = $header1; do { $rec[$_] = <FASTQ> } for (1..3); chomp(@rec); # process @rec = trim(@rec); # output print "\nOutput written to 'trimmed_$file'\n"; print "$_\n" for @rec; print OUT "$_\n" for @rec; } # trim sub trim { my @rec = @_; my @dna = split '', $rec[1]; my @qual = split '', $rec[3]; my @score = map{ ord($_)-33 } @qual; # lower quality from start my $start = 0; for (@score){ last if ($_ >= $cut_off); ++$start; } # lower quality from end my $end = @qual -1; for (reverse @score){ last if ($_ >= $cut_off); --$end; } print "Start=$start End=$end\n"; # trim @dna = @dna[$start..$end]; @qual = @qual[$start..$end]; @score = @score[$start..$end]; print "Trimmed ".join '',@dna,"\n"; # check each window print "\nWindows\n"; my @final_dna; my @final_qual; while (@dna > 0){ my @dna1 = splice @dna,0,$kmer; my @qual1 = splice @qual,0,$kmer; my @score1 = splice @score,0,$kmer; my $sum = 0; $sum += $_ for @score1; my $avg = $sum/@score1; print "@dna1 | @qual1 | @score1 | Sum=$sum Avg=$avg\n"; if ($avg > $cut_off){ push @final_dna,@dna1; push @final_qual,@qual1; } } $rec[1] = join '',@final_dna; $rec[3] = join '',@final_qual; return @rec; } __DATA__ @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
    update : corrected output
    poj