bugiep has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone,

I am doing a regex on my FASTA data which contains multiple protein sequences and their description as a header line. My output should consist of a header line and the sequence. When i print it to FILE my first header is missing. all other header lines are present. Any insight would be very welcome. below is my perl script:

use strict; use warnings; open IN, "zea_mays.txt"; open OUT, ">zea_mays1.txt"; my @peptides; my $seq; my $flag = 0; while(my $line = <IN>){ chomp($line); #check the chomp function if ($line =~ /^>/) { if ($flag == 0){ #the first protein entry, which means nothing + in the memory to do $flag = 1; next; } print "\n"; print OUT " $line\n "; }else { $line =~ s/\s//g; my @peptides = split(/(?<=[RK](?!P))/,$line); print OUT "@peptides\n"; } } close IN; close OUT;

Replies are listed 'Best First'.
Re: First header missing from FASTA data
by tangent (Parson) on Mar 15, 2012 at 01:42 UTC
    When you say if ($flag == 0) { ... next;} you are telling the program not to print the first header line. Remove that if block and it will. NYULMC?
Re: First header missing from FASTA data
by Khen1950fx (Canon) on Mar 15, 2012 at 03:13 UTC
    I edited your script and used <code></code> tags---never leave home without them:).
    #!/usr/bin/perl -l use strict; use warnings; my(@lines) = '/tmp/zea_mays.txt'; my $log = '/tmp/zea_mays.log'; die $! unless open IN, '<', @lines; die $! unless open OUT, '>', $log; my $peptides; my $seq; my $flag = 0; while ( defined( my $line = readline \*IN ) ) { do { foreach $line (@lines) { chomp $line; if ( $line =~ /^>/ ) { next if $flag > 0; print "\n"; print OUT "$line"; } else { $line =~ s/\s//g; my (@peptides) = split( /(?<=RK(?!P))/, $line ); print OUT "@peptides"; } } }; } close IN; close OUT;

      Not a very good translation of:

      use strict; use warnings; open IN, "zea_mays.txt"; open OUT, ">zea_mays1.txt"; my @peptides; my $seq; my $flag = 0; while(my $line = <IN>){ chomp($line); #check the chomp function if ($line =~ /^>/) { if ($flag == 0){ #the first protein entry, which means n +othing in the memory to do $flag = 1; next; } print "\n"; print OUT " $line\n "; }else { $line =~ s/\s//g; my @peptides = split(/(?<=[RK](?!P))/,$line); print OUT "@peptides\n"; } } close IN; close OUT;


      my(@lines) = '/tmp/zea_mays.txt';

      Why store a scalar value in an array?

      while ( defined( my $line = readline \*IN ) ) { do { foreach $line (@lines) {

      And then here you loop through that single value, a file name, and ignore the lines read from the file.    And what's up with the superfluous do block?



      my (@peptides) = split( /(?<=RK(?!P))/, $line );

      The original used the character class [RK] but you are using the string RK?

Re: First header missing from FASTA data
by bugiep (Initiate) on Mar 15, 2012 at 12:10 UTC
    Very many thanks guys i appreciate your assistance. i solved the problem as suggested by 'tangent'. Khen1950fx i shall study the way you go about the problem it might come in handy sometime.