Re^18: how to read input from a file, one section at a time?

I would suggest developing a separate program to clean up your data and leave the working program alone. It's always preferable to have 'clean data' to process and 2 steps I think are easier to debug. For example

#!/usr/bin/perl
# cleanup.pl
use strict;
use warnings;

print 'PLEASE ENTER THE FILENAME OF THE PROTEIN SEQUENCE: ';
chomp( my $prot_filename = <STDIN> );

open my $PROTFILE, '<', $prot_filename 
  or die "Cannot open '$prot_filename' because: $!";

my $out_filename = 'cleaned_'.$prot_filename;  
open my $OUTFILE, '>', $out_filename 
  or die "Cannot open '$out_filename' because: $!";
  
$/ = ''; # Set paragraph mode

my %fasta_seen;  # sequences seen so far
my $header;
my $count_in;
my $count_out;
while ( my $record = <$PROTFILE> ) {
  ++$count_in; 
  if ( $record =~ s/^>(.*)//m ){
    $header = $1;
    # skip fragments
    next if $header =~ /\(Fragments\)/i;
  };  
    
  # Remove comment line(s)
  $record =~ s/^\s*#.*//mg; 

  # trim trailing spaces  
  $record =~ s/\s+$//;

  # skip duplicated
  if ( $fasta_seen{ $record }++ ){
    print $OUTFILE "\n";
  } else {
    print $OUTFILE $header.$record."\n\n";
    ++$count_out;
  }
}
close $OUTFILE;
close $PROTFILE;
printf "%d records read from %s\n",$count_in,$prot_filename;
printf "%d records written to %s\n",$count_out,$out_filename;
[download]

I'm sure acknowledge of perlmonks.org would be appreciated by the community here.

poj

Comment on Re^18: how to read input from a file, one section at a time? Download Code

Replies are listed 'Best First'.
Re^19: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 02, 2019 at 21:11 UTC
Thank you so much Poj. Will do. :)	[reply]
Re^19: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 03, 2019 at 16:49 UTC
Hi, can someone tell me what's the mistake I'm doing in the following code? To be precise, in line- '`printf "%d duplicate records written\n",$name,$para,$out_filename;`'? I get an error: `Argument "sdAb_1193_LgLlama" isn't numeric in printf at ../duplicate.p +l line 35, <$PROTFILE> chunk 165. 'Redundant argument in printf at ../duplicate.pl line 35, <$PROTFILE> +chunk 1164.'` [download] #!/usr/bin/perl # cleanup.pl use strict; use warnings; print 'Enter protein sequence filename: '; chomp( my $prot_filename = <STDIN> ); open my $PROTFILE, '<', $prot_filename or die "Cannot open '$prot_filename' because: $!"; my $out_filename = 'duplicates_entries_in_'.$prot_filename; open my $OUTFILE, '>', $out_filename or die "Cannot open '$out_filename' because: $!"; $/ = ''; # Set paragraph mode my $name; my %fasta_seen; FASTA_RECORD: while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.)//m ){ $name = $1; }; # Remove comment line(s) $para =~ s/^\s#.//mg; # Trim trailing white space $para =~ s/\s+$//; # next FASTA_RECORD if $fasta_seen{ $para }++; if ( $fasta_seen{ $para }++ ){ printf "%d duplicate records written\n",$name,$para,$out_filename; next FASTA_RECORD; } } print "\n"; [download] 2019-04-07 Athanasius added code tags and removed break tags within code*	[reply] [d/l] [select]
Re^20: how to read input from a file, one section at a time? by poj (Abbot) on Apr 03, 2019 at 17:08 UTC
`printf "%d duplicate records written\n",$name,$para,$out_filename;` see printf and sprintf. You need to put %s in the print format for strings, %d for integers `while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.)//m ){ $name = $1; }; # Remove comment line(s) $para =~ s/^\s#.*//mg; # Trim trailing white space $para =~ s/\s+$//; # next FASTA_RECORD if $fasta_seen{ $para }++; if ( $fasta_seen{ $para }++ ){ printf "duplicate record %s %s \nwritten to %s\n",$name,$para,$out +_filename; print $OUTFILE '>'.$name.$para."\n\n"; } }` [download] poj	[reply] [d/l] [select]
Re^21: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 03, 2019 at 17:24 UTC
Ohh.. okay.. still learning.. Thanks.. :)	[reply]