Converting Uniprot File to a Fasta File in Perl

pearllearner315 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have an uniprot file that I need to run through and parse certain lines. Those certain lines have values that I need to construct a fasta format file. Here is an example uniprot file:


ID   ARF1_PLAFA              Reviewed;         181 AA.
AC   Q94650; O02502; O02593;
DT   15-JUL-1998, integrated into UniProtKB/Swiss-Prot.
DT   23-JAN-2007, sequence version 3.
DT   25-NOV-2008, entry version 52.
DE   RecName: Full=ADP-ribosylation factor 1;
GN   Name=ARF1; Synonyms=ARF, PLARF;
OS   Plasmodium falciparum.
OC   Eukaryota; Alveolata; Apicomplexa; Aconoidasida; Haemosporida;
OC   Plasmodium; Plasmodium (Laverania).
OX   NCBI_TaxID=5833;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RC   STRAIN=T9/96; TISSUE=Blood;
RX   MEDLINE=97112480; PubMed=8954160;
RX   DOI=10.1111/j.1432-1033.1996.0104r.x;
RA   Stafford W.H., Stockley R.W., Ludbrook S.B., Holder A.A.;
RT   "Isolation, expression and characterization of the gene for an AD
+P-
RT   ribosylation factor from the human malaria parasite, Plasmodium
RT   falciparum.";
RL   Eur. J. Biochem. 242:104-113(1996).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [MRNA].
RX   MEDLINE=97237566; PubMed=9084044; DOI=10.1016/S0166-6851(96)02803
+-4;
RA   Truong R.M., Francis S.E., Chakrabarti D., Goldberg D.E.;
RT   "Cloning and characterization of Plasmodium falciparum ADP-
RT   ribosylation factor and factor-like genes.";
RL   Mol. Biochem. Parasitol. 84:247-253(1997).
CC   -!- FUNCTION: GTP-binding protein that functions as an allosteric
CC       activator of the cholera toxin catalytic subunit, an ADP-
CC       ribosyltransferase. Involved in protein trafficking; may modu
+late
CC       vesicle budding and uncoating within the Golgi apparatus (By
CC       similarity).
CC   -!- SUBCELLULAR LOCATION: Golgi apparatus (By similarity).
CC   -!- SIMILARITY: Belongs to the small GTPase superfamily. Arf fami
+ly.
CC   -----------------------------------------------------------------
+------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org
+/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs Licen
+se
CC   -----------------------------------------------------------------
+------
DR   EMBL; Z80359; CAB02498.1; -; Genomic_DNA.
DR   EMBL; U57370; AAB63304.1; -; mRNA.
DR   HSSP; P32889; 1RRF.
DR   SMR; Q94650; 6-179.
DR   GO; GO:0005794; C:Golgi apparatus; IEA:UniProtKB-KW.
DR   GO; GO:0005525; F:GTP binding; IEA:InterPro.
DR   GO; GO:0015031; P:protein transport; IEA:UniProtKB-KW.
DR   GO; GO:0007264; P:small GTPase mediated signal transduction; IEA:
+InterPro.
DR   GO; GO:0016192; P:vesicle-mediated transport; IEA:UniProtKB-KW.
DR   InterPro; IPR006688; ARF.
DR   InterPro; IPR006689; ARF/SAR.
DR   InterPro; IPR001806; Ras_trnsfrmng.
DR   InterPro; IPR005225; Small_GTP_bd.
DR   PANTHER; PTHR11711; ARF/SAR; 1.
DR   Pfam; PF00025; Arf; 1.
DR   PRINTS; PR00449; RASTRNSFRMNG.
DR   PRINTS; PR00328; SAR1GTPBP.
DR   SMART; SM00177; ARF; 1.
DR   TIGRFAMs; TIGR00231; small_GTP; 1.
DR   PROSITE; PS01019; ARF; 1.
PE   2: Evidence at transcript level;
KW   ER-Golgi transport; Golgi apparatus; GTP-binding; Lipoprotein;
KW   Myristate; Nucleotide-binding; Protein transport; Transport.
FT   INIT_MET      1      1       Removed (Potential).
FT   CHAIN         2    181       ADP-ribosylation factor 1.
FT                                /FTId=PRO_0000207447.
FT   NP_BIND      24     31       GTP (By similarity).
FT   NP_BIND      67     71       GTP (By similarity).
FT   NP_BIND     126    129       GTP (By similarity).
FT   LIPID         2      2       N-myristoyl glycine (Potential).
SQ   SEQUENCE   181 AA;  20912 MW;  18013B069BEA2413 CRC64;
 MGLYVSRLFN RLFQKKDVRI LMVGLDAAGK TTILYKVKLG EVVTTIPTIG FNVETVEFRN
 ISFTVWDVGG QDKIRPLWRH YYSNTDGLIF VVDSNDRERI DDAREELHRM INEEELKDAI
 ILVFANKQDL PNAMSAAEVT EKLHLNTIRE RNWFIQSTCA TRGDGLYEGF DWLTTHLNNA
K
[download]

I need to use regex and select the values of the "AC" line, "OS" line, "OX" line, "ID" line, "GN" line, "SQ" line and construct the fasta format which should look like this. The first line of the fasta format consists of the values from the line headings parsed from the uniprot file and are separated by "|". Here is an example of a fasta file:

 
>NM_012514 | Rattus norvegicus | breast cancer 1 (Brca1) | mRNA
 CGCTGGTGCAACTCGAAGACCTATCTCCTTCCCGGGGGGGCTTCTCCGGCATTTAGGCCT
 CGGCGTTTGGAAGTACGGAGGTTTTTCTCGGAAGAAAGTTCACTGGAAGTGGAAGAAATG
 GATTTATCTGCTGTTCGAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAAAATC
 TTGGAGTGTCCAATCTGTTTGGAACTGATCAAAGAACCGGTTTCCACACAGTGCGACCAC
 ATATTTTGCAAATTTTGTATGCTGAAACTCCTTAACCAGAAGAAAGGACCTTCCCAGTGT
 CCTTTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAAGGAAGTGCAAGG
[download]

some code I have so far:

 
#!/usr/bin/perl
use warnings;
use strict;

unless (open(UNIPROT, "<", "uniprotfile")) {
 die "Unable to open uniprot file", $!;
}

while (<UNIPROT>) {
 my $lines = $_; 
 if ($lines =~ /^AC\s+(.*)\;|^OS\s+(.*)|^OX\s+(.*)|^ID\s+(.*)|^GN\s+(.
+*)/)
    print  $1, $2, $3, $4, $5, "\n"; 
 }
[download]

I just printed $1, $2, $3, $4, and $5 just to see if i was able to capture the values that the regex matched. However I keep getting this output when I try printing:


Use of uninitialized value $1 in print at ./file.pl line 11, <UNIPROT>
+ line 1.
Use of uninitialized value $2 in print at ./file.pl line 11, <UNIPROT>
+ line 1.
Use of uninitialized value $3 in print at ./file.pl line 11, <UNIPROT>
+ line 1.
Use of uninitialized value $5 in print at ./file.pl line 11, <UNIPROT>
+ line 1.
CERU_HUMAN     STANDARD;      PRT;  1065 AA.
Use of uninitialized value $2 in print at ./file.pl line 11, <UNIPROT>
+ line 2.
Use of uninitialized value $3 in print at ./file.pl line 11, <UNIPROT>
+ line 2.
Use of uninitialized value $4 in print at ./file.pl line 11, <UNIPROT>
+ line 2.
Use of uninitialized value $5 in print at ./file.pl line 11, <UNIPROT>
+ line 2.
P00450; Q14063
Use of uninitialized value $1 in print at ./file.pl line 11, <UNIPROT>
+ line 7.
Use of uninitialized value $2 in print at ./file.pl line 11, <UNIPROT>
+ line 7.
Use of uninitialized value $3 in print at ./file.pl line 11, <UNIPROT>
+ line 7.
Use of uninitialized value $4 in print at ./file.pl line 11, <UNIPROT>
+ line 7.
CP.
Use of uninitialized value $1 in print at ./file.pl line 11, <UNIPROT>
+ line 8.
Use of uninitialized value $3 in print at ./file.pl line 11, <UNIPROT>
+ line 8.
Use of uninitialized value $4 in print at ./file.pl line 11, <UNIPROT>
+ line 8.
Use of uninitialized value $5 in print at ./file.pl line 11, <UNIPROT>
+ line 8.
Homo sapiens (Human).
Use of uninitialized value $1 in print at ./file.pl line 11, <UNIPROT>
+ line 11.
Use of uninitialized value $2 in print at ./file.pl line 11, <UNIPROT>
+ line 11.
Use of uninitialized value $4 in print at ./file.pl line 11, <UNIPROT>
+ line 11.
Use of uninitialized value $5 in print at ./file.pl line 11, <UNIPROT>
+ line 11.
NCBI_TaxID=9606;
[download]

I'm not exactly sure where i'm making the mistake..does the "or |" part mess up the loop? Thank you for your help!

Comment on Converting Uniprot File to a Fasta File in Perl Select or Download Code

Replies are listed 'Best First'.
Re: Converting Uniprot File to a Fasta File in Perl by haukex (Archbishop) on Feb 27, 2017 at 16:24 UTC
Your regex uses the alternation operator `\|` to match one of multiple patterns, so for each line, only one of the five capture groups will capture something, the others will be `undef`, which is why you're getting that warning. Here's one way to get closer to what you want: `while (defined( my $line = <UNIPROT> )) { if ($line =~ /^(AC\|OS\|OX\|ID\|GN)\s+(.)/) { print "<$1> $2\n"; } } __END__ <ID> ARF1_PLAFA Reviewed; 181 AA. <AC> Q94650; O02502; O02593; <GN> Name=ARF1; Synonyms=ARF, PLARF; <OS> Plasmodium falciparum. <OX> NCBI_TaxID=5833;` [download] Since the file is processed line-by-line, I've renamed your variable from `$lines` to `$line`. If I were writing this code, here's how I might have written it: `#!/usr/bin/env perl use warnings; use strict; my $filename = "uniprotfile"; open my $ufh, "<", $filename or die "open $filename: $!"; while (<$ufh>) { chomp; my ($id,$content) = /^(AC\|OS\|OX\|ID\|GN)\s+(.)/ or next; if ($id eq 'AC') { my ($first) = $content=~/^([^;]+)/ or die "couldn't parse '$content'"; print "AC: $first\n"; ... } elsif ($id eq 'OS') { ... } ... }` [download]	[reply] [d/l] [select]
Re^2: Converting Uniprot File to a Fasta File in Perl by pearllearner315 (Acolyte) on Feb 27, 2017 at 19:06 UTC
for the last group which will belong to the "SQ" line, how would i capture the multi line sequence into a variable? I would think to use `$line =~ /^SQ\s+(.*)/` again but that regex would capture the multiple white spaces in between the sequence.	[reply] [d/l]
Re^3: Converting Uniprot File to a Fasta File in Perl by poj (Abbot) on Feb 27, 2017 at 19:43 UTC
Use a flag to capture the multiple lines. Remove the spaces with a regex. `my %hash=(); my $seq; my $flag = 0; while (<$ufh>) { chomp; if ( /^(AC\|OS\|OX\|ID\|GN\|SQ)\s+(.*)/ ){ print "<$1> <$2>\n"; $hash{$1} = $2; $flag = 1 if /SQ/; } elsif (/^K\s+/){ $flag = 0; } elsif ($flag == 1){ s/ +//g; # remove spaces $seq .= $_."\n" } } print Dumper \%hash; print $seq;` [download] poj	[reply] [d/l]
Re^4: Converting Uniprot File to a Fasta File in Perl by pearllearner315 (Acolyte) on Feb 27, 2017 at 23:38 UTC
Re^5: Converting Uniprot File to a Fasta File in Perl by huck (Prior) on Feb 27, 2017 at 23:55 UTC
Re^5: Converting Uniprot File to a Fasta File in Perl by AnomalousMonk (Archbishop) on Feb 28, 2017 at 01:52 UTC
Re: Converting Uniprot File to a Fasta File in Perl by genio (Beadle) on Feb 27, 2017 at 16:23 UTC
Hi! You have a few things that could be improved in your Perl (don't use bareword file handles) and a few things in your expression as well: `#!/usr/bin/perl use warnings; use strict; open my $fh, '<', 'uniprotfile' or die "Can't open uniprot file: $!"; while (defined(my $line = <$fh>)) { chomp $line; if ($line =~ /^(AC\|OS\|OX\|ID\|GN)\s+(.*)/) { print $1, $2, "\n"; } }` [download] Instead of repeating yourself in the expression, simplify it a bit.	[reply] [d/l]
Re: Converting Uniprot File to a Fasta File in Perl by erix (Prior) on Feb 27, 2017 at 21:42 UTC
Unless this is a programming exercise, it will be easiest to just parse out the accessions (the first one, the 'primary accession number', suffices, the rest are old ones, kept for 'backward compatibility') and use those primary accessions to query the uniprot.org website. To get the fasta for an accession, say 'Q94650', just construct a url like: `http://www.uniprot.org/uniprot/Q94650.fasta` and go get the fasta with wget or indeed perl LWP. It'll return: `>sp\|Q94650\|ARF1_PLAFA ADP-ribosylation factor 1 OS=Plasmodium falcipar +um GN=ARF1 PE=1 SV=3 MGLYVSRLFNRLFQKKDVRILMVGLDAAGKTTILYKVKLGEVVTTIPTIGFNVETVEFRN ISFTVWDVGGQDKIRPLWRHYYSNTDGLIFVVDSNDRERIDDAREELHRMINEEELKDAI ILVFANKQDLPNAMSAAEVTEKLHLNTIRERNWFIQSTCATRGDGLYEGFDWLTTHLNNA K` [download]	[reply] [d/l] [select]
Re: Converting Uniprot File to a Fasta File in Perl by haukex (Archbishop) on Feb 28, 2017 at 10:43 UTC
I was considering posting another example of a parser, but having researched the question a bit more, I can only second what erix wrote! Also, my understanding is that there are already existing modules to handle these things, such as BioPerl, which appears to have multiple existing parsers for this format, such as Bio::SeqIO::swiss. This is my first time trying out BioPerl so probably there's an even better way, but this seems to work: use warnings; use strict; use HTTP::Tiny; use Bio::SeqIO; my $filename = '/tmp/Q94650.txt'; HTTP::Tiny->new->mirror( 'http://www.uniprot.org/uniprot/Q94650.txt', $filename)->{success} or die "Failed to fetch"; my $stream = Bio::SeqIO->new( -file => $filename, -format => 'swiss'); while ( my $seq = $stream->next_seq() ) { # just some examples... print $seq->accession_number, "\n"; print $seq->species->species, "\n"; my ($gene) = $seq->annotation->get_Annotations('gene_name'); print $gene->findval('Name'), "\n"; } [download] Or using the included bp_seqconvert utility: `$ wget -q http://www.uniprot.org/uniprot/Q94650.txt $ bp_seqconvert.pl --from swiss --to fasta <Q94650.txt >ARF1_PLAFA RecName: Full=ADP-ribosylation factor 1; Short=plARF; MGLYVSRLFNRLFQKKDVRILMVGLDAAGKTTILYKVKLGEVVTTIPTIGFNVETVEFRN ISFTVWDVGGQDKIRPLWRHYYSNTDGLIFVVDSNDRERIDDAREELHRMINEEELKDAI ILVFANKQDLPNAMSAAEVTEKLHLNTIRERNWFIQSTCATRGDGLYEGFDWLTTHLNNA K` [download]	[reply] [d/l] [select]
Re: Converting Uniprot File to a Fasta File in Perl by Cow1337killr (Monk) on Feb 28, 2017 at 14:48 UTC
It is considered poor etiquette to cross-post a question without providing a link to the other forum(s). For example, see Cross-posting Policy? I assume this was posted by you, also: https://www.biostars.org/p/238376/ However, if you discover a cure for the common cold, I am sure we will all forgive you.	[reply]


Do you know where your variables are?
	PerlMonks