Re: Converting Uniprot File to a Fasta File in Perl

Your regex uses the alternation operator | to match one of multiple patterns, so for each line, only one of the five capture groups will capture something, the others will be undef, which is why you're getting that warning. Here's one way to get closer to what you want:

while (defined( my $line = <UNIPROT> )) {
    if ($line =~ /^(AC|OS|OX|ID|GN)\s+(.*)/) {
        print "<$1> $2\n";
    }
}

__END__

<ID> ARF1_PLAFA              Reviewed;         181 AA.
<AC> Q94650; O02502; O02593;
<GN> Name=ARF1; Synonyms=ARF, PLARF;
<OS> Plasmodium falciparum.
<OX> NCBI_TaxID=5833;
[download]

Since the file is processed line-by-line, I've renamed your variable from $lines to $line. If I were writing this code, here's how I might have written it:

#!/usr/bin/env perl
use warnings;
use strict;

my $filename = "uniprotfile";
open my $ufh, "<", $filename
    or die "open $filename: $!";

while (<$ufh>) {
    chomp;
    my ($id,$content) = /^(AC|OS|OX|ID|GN)\s+(.*)/
        or next;
    if ($id eq 'AC') {
        my ($first) = $content=~/^([^;]+)/
            or die "couldn't parse '$content'";
        print "AC: $first\n";
        ...
    }
    elsif ($id eq 'OS') {
        ...
    }
    ...
}
[download]

Comment on Re: Converting Uniprot File to a Fasta File in Perl Select or Download Code

Replies are listed 'Best First'.
Re^2: Converting Uniprot File to a Fasta File in Perl by pearllearner315 (Acolyte) on Feb 27, 2017 at 19:06 UTC
for the last group which will belong to the "SQ" line, how would i capture the multi line sequence into a variable? I would think to use `$line =~ /^SQ\s+(.*)/` again but that regex would capture the multiple white spaces in between the sequence.	[reply] [d/l]
Re^3: Converting Uniprot File to a Fasta File in Perl by poj (Abbot) on Feb 27, 2017 at 19:43 UTC
Use a flag to capture the multiple lines. Remove the spaces with a regex. `my %hash=(); my $seq; my $flag = 0; while (<$ufh>) { chomp; if ( /^(AC\|OS\|OX\|ID\|GN\|SQ)\s+(.*)/ ){ print "<$1> <$2>\n"; $hash{$1} = $2; $flag = 1 if /SQ/; } elsif (/^K\s+/){ $flag = 0; } elsif ($flag == 1){ s/ +//g; # remove spaces $seq .= $_."\n" } } print Dumper \%hash; print $seq;` [download] poj	[reply] [d/l]
Re^4: Converting Uniprot File to a Fasta File in Perl by pearllearner315 (Acolyte) on Feb 27, 2017 at 23:38 UTC
hi poj! thank you so much for your help! could you explain what the flag is actually doing? i'm reading the code and i'm having difficulty understanding. also what is Dumper? i'm trying to format my code so that I can get this type of output once i parse the headers and sequence: `>NM_012514 \| Rattus norvegicus \| breast cancer 1 (Brca1) \| mRNA CGCTGGTGCAACTCGAAGACCTATCTCCTTCCCGGGGGGGCTTCTCCGGCATTTAGGCCT CGGCGTTTGGAAGTACGGAGGTTTTTCTCGGAAGAAAGTTCACTGGAAGTGGAAGAAATG GATTTATCTGCTGTTCGAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAAAATC TTGGAGTGTCCAATCTGTTTGGAACTGATCAAAGAACCGGTTTCCACACAGTGCGACCAC ATATTTTGCAAATTTTGTATGCTGAAACTCCTTAACCAGAAGAAAGGACCTTCCCAGTGT CCTTTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAAGGAAGTGCAAGG` [download]	[reply] [d/l]
Re^5: Converting Uniprot File to a Fasta File in Perl by huck (Prior) on Feb 27, 2017 at 23:55 UTC
Re^5: Converting Uniprot File to a Fasta File in Perl by AnomalousMonk (Archbishop) on Feb 28, 2017 at 01:52 UTC