Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^16: how to read input from a file, one section at a time?

by poj (Abbot)
on Apr 02, 2019 at 18:05 UTC ( [id://1232038]=note: print w/replies, xml ) Need Help??


in reply to Re^15: how to read input from a file, one section at a time?
in thread how to read input from a file, one section at a time?

I suspect the problem is trailing whitespace (carriage returns, line feeds, spaces etc) on your sequences

# Remove comment line(s) $para =~ s/^\s*#.*//mg; # Trim trailing white space $para =~ s/\s+$//; # add this ..
poj

Replies are listed 'Best First'.
Re^17: how to read input from a file, one section at a time?
by davi54 (Sexton) on Apr 02, 2019 at 18:38 UTC
    Hi,
    That fixed the issue. Thank you so much.
    I have a related question. Is there a way to remove all the duplicate entries and/or replace them with a newline character? Duplicates are skewing my data.
    Also, there are some entries which have incomplete sequences and have the word "(Fragments)" in the header. Can you suggest a way to delete all those entries with the word (Fragment) in their header?
    Thanks again for all your help Poj. I will be presenting this data at a conference in Winnipeg next month. I would be happy to acknowledge your help in this work. Please let me know if you want me to do that.

      I would suggest developing a separate program to clean up your data and leave the working program alone. It's always preferable to have 'clean data' to process and 2 steps I think are easier to debug. For example

      #!/usr/bin/perl # cleanup.pl use strict; use warnings; print 'PLEASE ENTER THE FILENAME OF THE PROTEIN SEQUENCE: '; chomp( my $prot_filename = <STDIN> ); open my $PROTFILE, '<', $prot_filename or die "Cannot open '$prot_filename' because: $!"; my $out_filename = 'cleaned_'.$prot_filename; open my $OUTFILE, '>', $out_filename or die "Cannot open '$out_filename' because: $!"; $/ = ''; # Set paragraph mode my %fasta_seen; # sequences seen so far my $header; my $count_in; my $count_out; while ( my $record = <$PROTFILE> ) { ++$count_in; if ( $record =~ s/^>(.*)//m ){ $header = $1; # skip fragments next if $header =~ /\(Fragments\)/i; }; # Remove comment line(s) $record =~ s/^\s*#.*//mg; # trim trailing spaces $record =~ s/\s+$//; # skip duplicated if ( $fasta_seen{ $record }++ ){ print $OUTFILE "\n"; } else { print $OUTFILE $header.$record."\n\n"; ++$count_out; } } close $OUTFILE; close $PROTFILE; printf "%d records read from %s\n",$count_in,$prot_filename; printf "%d records written to %s\n",$count_out,$out_filename;

      I'm sure acknowledge of perlmonks.org would be appreciated by the community here.

      poj
        Thank you so much Poj. Will do.
        :)
        Hi, can someone tell me what's the mistake I'm doing in the following code? To be precise, in line- 'printf "%d duplicate records written\n",$name,$para,$out_filename;'?
        I get an error:
        Argument "sdAb_1193_LgLlama" isn't numeric in printf at ../duplicate.p +l line 35, <$PROTFILE> chunk 165. 'Redundant argument in printf at ../duplicate.pl line 35, <$PROTFILE> +chunk 1164.'
        #!/usr/bin/perl # cleanup.pl use strict; use warnings; print 'Enter protein sequence filename: '; chomp( my $prot_filename = <STDIN> ); open my $PROTFILE, '<', $prot_filename or die "Cannot open '$prot_filename' because: $!"; my $out_filename = 'duplicates_entries_in_'.$prot_filename; open my $OUTFILE, '>', $out_filename or die "Cannot open '$out_filename' because: $!"; $/ = ''; # Set paragraph mode my $name; my %fasta_seen; FASTA_RECORD: while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.*)//m ){ $name = $1; }; # Remove comment line(s) $para =~ s/^\s*#.*//mg; # Trim trailing white space $para =~ s/\s+$//; # next FASTA_RECORD if $fasta_seen{ $para }++; if ( $fasta_seen{ $para }++ ){ printf "%d duplicate records written\n",$name,$para,$out_filename; next FASTA_RECORD; } } print "\n";

        2019-04-07 Athanasius added code tags and removed break tags within code

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1232038]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (2)
As of 2024-04-25 06:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found