in reply to replace FASTA sequence headers

Hello onlyIDleft,

There are a number of problems with your code. For example, lines such as

%head_seqs = @headers_seqs;

make no sense. You need to read up on perl data structures to understand arrays and hashes.

I think the following script does what you need:

use strict; use warnings; open(my $in1, '<', $ARGV[0]) or die "Can't read from multifasta file '$ARGV[0]': $!"; my @multifasta = <$in1>; close $in1; open(my $in2, '<', $ARGV[1]) or die "Can't read from header replacement file '$ARGV[1]': $!"; my %headers; while (<$in2>) { my ($orig, $new) = split; warn "Duplicate replacement: $orig --> $new\n" if exists $headers{ +$orig}; $headers{$orig} = $new; } close $in2; my $destination = $ARGV[0] . '_headers-replaced.fasta'; open (my $out, '>', $destination) or die "Can't write to file '$destination': $!"; foreach my $line (@multifasta) { if ($line =~ /^\>/) # it is a header { foreach my $key (keys %headers) { if ($line =~ /$key/) { $line =~ s/$key/$headers{$key}/; last; } } } print $out $line; } close $out;

HTH,

Athanasius <°(((><contra mundum

Replies are listed 'Best First'.
Re^2: replace FASTA sequence headers
by jwkrahn (Abbot) on Jun 10, 2012 at 20:38 UTC
    lines such as
    %head_seqs = @headers_seqs;
    make no sense.

    It makes perfect sense.    Copying values between lists, arrays and hashes is a normal idiom in Perl.

    if ($line =~ /$key/) { $line =~ s/$key/$headers{$key}/; last; }

    Your $key values may contain regular expression meta-characters so you should use quotemeta on them.

    Your loop exits on the first match but not necessarily the correct match.    You should anchor the patterns.

      Thanks for your replies jwkrahn.

      Actually, EACH of my values in the hash of new / old headers have # character in them, some of them have other special characters such as _ etc

      I was under the impression that because my keys are all letter/number strings, that there should not be any problem with having special characters in the values, BUT not in the keys themselves

      Am I missing something here about needing "quotemeta" for the hash's values as well?

      I hope I am explaining the header pairs properly

      Here is an example below:

      cluster14621 \t Helitron#element_1

        Am I missing something here about needing "quotemeta" for the hash's values as well?

        I was suggesting quotemeta not because of the hash, but because of their use in a regular expression.

Re^2: replace FASTA sequence headers
by Anonymous Monk on Jun 10, 2012 at 18:14 UTC

    Thanks a lot Athanasius

    There are so many ideas that are new to me here, from just one script of yours! Thanks :)