uuuiii has asked for the wisdom of the Perl Monks concerning the following question:

Hi! :) I'm new here and I'm also at my first approaches to Bioinformatics. I found this Perl script and I don't really understand all the passages, in particular in the end. Could someone explain me it, maybe telling me why programmers used these ways to write the script and not others? I know maybe this is not a real good question but I'd really like to understand this script and I don't know how to do it... I have already search for the meanings of the different features in the manual but I can't understand why some of them have been applied here... Thank you very much and I hope you could understand my English!

Update: crossposted here http://www.biostars.org/p/297182/

$file=shift; print "Window length\n"; $kmer=<STDIN>; chomp ($kmer); print "Minimum quality score cut-off\n"; $cut_off=<STDIN>; chomp ($cut_off); open (MYFILE, ">R.txt"); if (open(FASTQ,$file)) { while($header1=<FASTQ>) { $dna=<FASTQ>; $header2=<FASTQ>; $qual=<FASTQ>; @dna=split ('', $dna); @qual=split ('', $qual); @num_value=(); @scores=(); foreach $qscore (@qual) { $num_value=ord($qscore)-33; push(@num_value, $num_value); } foreach $value (@num_value) { if ($value<$cut_off){ pop(@qual); + }else{ last; } } $sub1=substr($dna,-$#qual); $sub11=substr($qual,-$#qual); @qscopy=reverse @num_value; foreach $value (@qscopy) { if ($value<$cut_off){ pop(@qual); }else{ last; } } $sub2=substr($sub1,0,$#qual); $sub22=substr($sub11,0,$#qual); for ($i=0;$i<=$#qual-($kmer-1);$i++) { @scores=@num_value[$i..$i+$kmer-1]; $sum=0; foreach $score (@scores) { $sum+=$score; } if (($sum/$kmer)<$cut_off){ last; } } $sub3=substr($sub2,0,($i-1)); $sub33=substr($sub22,0,($i-1)); print MYFILE "$header1$sub3\n$header2$sub33\n\n"; } }else{ print "Error!\n"; } close MYFILE;

Replies are listed 'Best First'.
Re: explanation of passages of a Perl script
by hippo (Archbishop) on Feb 06, 2018 at 14:24 UTC

    I'm glad you haven't written this yourself because it is not in a good way. There are 2 problems with the layout which don't help with analysis:

    1. Inconsistent indenting and padding
    2. No comments or documentation whatsoever

    We can fix the first by using perltidy. This gives the much more legible:

    $file = shift; print "Window length\n"; $kmer = <STDIN>; chomp ($kmer); print "Minimum quality score cut-off\n"; $cut_off = <STDIN>; chomp ($cut_off); open (MYFILE, ">R.txt"); if (open (FASTQ, $file)) { while ($header1 = <FASTQ>) { $dna = <FASTQ>; $header2 = <FASTQ>; $qual = <FASTQ>; @dna = split ('', $dna); @qual = split ('', $qual); @num_value = (); @scores = (); foreach $qscore (@qual) { $num_value = ord ($qscore) - 33; push (@num_value, $num_value); } foreach $value (@num_value) { if ($value < $cut_off) { pop (@qual); } else { last; } } $sub1 = substr ($dna, -$#qual); $sub11 = substr ($qual, -$#qual); @qscopy = reverse @num_value; foreach $value (@qscopy) { if ($value < $cut_off) { pop (@qual); } else { last; } } $sub2 = substr ($sub1, 0, $#qual); $sub22 = substr ($sub11, 0, $#qual); for ($i = 0; $i <= $#qual - ($kmer - 1); $i++) { @scores = @num_value[$i .. $i + $kmer - 1]; $sum = 0; foreach $score (@scores) { $sum += $score; } if (($sum / $kmer) < $cut_off) { last; } } $sub3 = substr ($sub2, 0, ($i - 1)); $sub33 = substr ($sub22, 0, ($i - 1)); print MYFILE "$header1$sub3\n$header2$sub33\n\n"; } } else { print "Error!\n"; } close MYFILE;

    See how much more structured that appears?. Now, here's what's wrong with the actual code:

    1. No strict
    2. No warnings
    3. No sanitising of arguments or inputs
    4. No declarations of variables
    5. Use of multiple variables ($sub1, $sub2, $sub3) where arrays would be better
    6. Use of C-ish rather than perlish "for" loops
    7. 2-argument form of open where 3-argument would be better
    8. Bareword instead of lexical filehandles (this might be because it's very old code but it's an easy one to update)

    As to why they have written it like this, I could not say. Without comments, documentation, sample inputs and so on it is a guessing game. If there's one point in particular which you don't understand then maybe highlight that?

      Thank you for answering

      I have doubts about all the script because like you I think it is a bit confusing... I'm a new learner of Perl and in general of programming language and as exercise I'm trying to modify this script in order to make it better.

      I understand for what it has been done (it is a quality trimming for a FASTQ file) but if I were the programmer I dont' think I would have written the script like that.

      Maybe this will be a stupid question but I have serious problems to understand the variables (like $sub) that have been used, in particular when we have the feature "substr". I have to find more info in the manual and I will do that for sure... also the variables -$#qual: why I have to write them like that?

      Moreover, I noted the absence of the use strict, use warnings and probably of use diagnostics... In fact I rewrite the script using them and with use script I don't have any problem but when I use warnings and diagnostics I have an error of unitialized value in operation that I can't really resolve (if you prefer I could write you down my script but it is practically the same thing only with "my". I have also changed the way to read the FASTQ file preferring the simple way to write directly in the script the name of the FASTQ file of my interest).

      I'm really sorry to annoy all of you with these problems but I would really like to understand how to work with this script and how I could ameliorate it...

      Thank you very much
        I have serious problems to understand the variables (like $sub) that have been used, in particular when we have the feature "substr".

        substr() is a built-in function in perl. All of the built-in functions are documented and if your machine has the perl documentation installed you can simply run perldoc -f substr to read it. Alternatively, it is available online here.

        also the variables -$#qual: why I have to write them like that?

        $#qual is a special syntax, it returns the index of the last element in the array @qual. See the Arrays: A Tutorial/Reference for a discussion of this and many other aspects of using arrays in perl.

        when I use warnings and diagnostics I have an error of unitialized value in operation that I can't really resolve

        You can find the offending term and then surround it with a suitable "if" statement. Suppose $foo might sometimes be undefined, but you want to print it, you might then use:

        if (defined $foo) { print $foo; }

        or, equivalently

        print $foo if defined $foo;

        Perl is a very big language with lots to learn. I think you would benefit from reading some of the Tutorials here covering the areas that you know less well. Stick at it - you will soon be amazed at what you can achieve.

        Further to hippo's excellent comments, I would add with regard to

        $dna = <FASTQ>; ...; $qual = <FASTQ>; @dna = split ('', $dna); @qual = split ('', $qual);
        and similar from the tidied source that the code uses the same name for variables of different type. IMHO, it's not, in general, good practice to do this; eventually, you'll give yourself a terrible headache. Don't do that. In general, use informative, distinctive names for variables, subroutines, etc.


        Give a man a fish:  <%-{-{-{-<

      Lets see you rewrite that cish foreach loop.... Yeah thought so
        for my $i (0 .. @qual - $kmer) {

        Q.E.D.

Re: explanation of passages of a Perl script
by Eily (Monsignor) on Feb 06, 2018 at 16:20 UTC

    I'd really like to understand this script
    That's probably a bad idea. Either you just want to understand how it works, and you've already seen by yourself that it's probably not the good way to do it, so you shouldn't learn from it ; or you want to use it as a base for custom modifications, and I'm pretty sure that writing from scratch will be easier (there's only 78 lines of code, it's like a lot of work went into it).

    The thread that poj referenced contains a link to the Bio::SeqIO::fastq module, which might be the best starting point for you if you want to reuse something that works. Or if you want to rewrite the whole thing (that's one way to learn), it will be easier for us to help you on specific blocking points rather than this quite obscure script.

Re: explanation of passages of a Perl script
by poj (Abbot) on Feb 06, 2018 at 15:48 UTC
      don't know, I found the script online. I will read this post, maybe I found something useful. Thanks

      Update: yes, I wrote also there in order to have more info. Is that a problem? if it is I will remove the post.

        ... I [crossposted] to have more info. Is that a problem?

        Crossposting is not a problem, but it is considered polite to note and link to the crosspost so that the efforts of those trying to help you are not needlessly duplicated at the far-flung corners of the InterWebs. Please see What shortcuts can I use for linking to other information?

        Update:

        Is that a problem? if it is I will remove the post.
        In general, removing a post is Frowned Upon. Updating a post to add more info is ok — like this update! Please see How do I change/delete my post?


        Give a man a fish:  <%-{-{-{-<

Re: explanation of passages of a Perl script
by Anonymous Monk on Feb 06, 2018 at 14:40 UTC

    The last verb, used quite-frequently here, causes the enclosing for or foreach loop to end immediately.

    Like you, I find the code as-written to be cryptic and confusing. Variable names like $sub2 and $sub22 do not convey any semantic meaning or clues. Simple comments, directed to "the next programmer who will be looking at this," would help enormously. "Talk to me ... tell me what you are doing here, and why. Use variable-names that are suggestive." Use use strict; use warnings; and declare all of the variables that you use. Such niceties might mean nothing to the Perl compiler, but they mean everything to real people.