Re^9: how to read input from a file, one section at a time?

From poj's code:

my $name;
while ( my $para = <$PROTFILE> ) {
    # Remove fasta header line
    if ( $para =~ s/^>(.*)//m ){
      $name = $1;
    };
    ...
}
[download]

A quick and dirty and UNTESTED modification to do what I think you want:

my $name;
my %name_seen;  # fasta headers seen so far

FASTA_RECORD:
while ( my $para = <$PROTFILE> ) {
    # Remove fasta header line
    if ( $para =~ s/^>(.*)//m ){
      $name = $1;
      next FASTA_RECORD if $name_seen{ $name }++;
    };
    ...
}
[download]

Warning: The requirement to "... get rid of duplicate entries ..." is ambiguous. If there is more than one entry with the same header (i.e., $name), which is (or are, if there are more than two) the duplicate(s)? The first one? The last one? Etc. The code modification above ignores all entries with a given $name after the first one. Also, it might be wise to trim all leading/trailing whitespace from $name before any further processing whatsoever (also untested):
$name = $1;
$name =~ s{ \A \s+ | \s+ \z }{}xmsg;

Give a man a fish: <%-{-{-{-<

Comment on Re^9: how to read input from a file, one section at a time? Select or Download Code

Replies are listed 'Best First'.
Re^10: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 02, 2019 at 15:32 UTC
Hi, My apologies for not being clear. Just to let you know, multiple proteins can have different header sequences but identical sequence information. When I say duplicate entries, I mean the actual sequence (and not the header). I want the script to read the input file and identify if there are more than one entries with the same sequence information and print them. Does that help? Again, sorry for the confusion and thank you for your help.	[reply]
Re^11: how to read input from a file, one section at a time? by poj (Abbot) on Apr 02, 2019 at 15:43 UTC
Try `my %fasta_seen; FASTA_RECORD: while ( my $para = <$PROTFILE> ) { # Remove fasta header line if ( $para =~ s/^>(.)//m ){ $name = $1; }; # Remove comment line(s) $para =~ s/^\s#.*//mg; next FASTA_RECORD if $fasta_seen{ $para }++; …` [download] This may not be a sensible solution if your sequences are very long in which case consider using a message digest like Digest::MD5 poj	[reply] [d/l]
Re^12: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 02, 2019 at 16:02 UTC
And how do I print the duplicate entries?	[reply]
Re^13: how to read input from a file, one section at a time? by poj (Abbot) on Apr 02, 2019 at 16:18 UTC
Re^14: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 02, 2019 at 17:40 UTC
Some notes below your chosen depth have not been shown here
Re^14: how to read input from a file, one section at a time? by davi54 (Sexton) on Apr 02, 2019 at 16:38 UTC
Some notes below your chosen depth have not been shown here