comment on

Hi all, I am trying to parse and process a fasta file. Basically I need to 'simplify' the headers. I want to extract the GI (or protein_id, if there isn't a GI) from the current header, and then create a new file where just the GI or protein_id is the header. It is mostly working fine, except that I am not capturing the "protein_id". I guess I am having problems with my regex

here are some example headers

>1001585.MDS_0001 protein_id="YP_004377784.1" product="chromosomal rep
+lication initiation protein" GI="330500915" GeneID="10459818"
>1001585.MDS_0002 protein_id="YP_004377785.1" product="DNA polymerase 
+III subunit beta" GI="330500916" GeneID="10454784"
>1001585.MDS_0003 protein_id="YP_004377786.1" product="recombination p
+rotein F" GI="330500917" GeneID="10454785"
>1001585.MDS_0004 protein_id="YP_004377787.1" product="DNA gyrase subu
+nit B" GI="330500918" GeneID="10454786"
>1001585.MDS_0005 protein_id="YP_004377788.1" GI="330500919" GeneID="1
+0454787"
>1001585.MDS_0006 protein_id="YP_004377789.1" GI="330500920" GeneID="1
+0454788"
>1001585.MDS_0007 protein_id="YP_004377790.1" GI="330500921" GeneID="1
+0454789"
>1001585.MDS_0008 protein_id="YP_004377791.1" GI="330500922" GeneID="1
+0454790"
>1001585.MDS_0009 protein_id="YP_004377792.1" product="ABC transporter
+ permease" GI="330500923" GeneID="10454791"
>1001585.MDS_0010 protein_id="YP_004377793.1" product="ABC transporter
+ ATP-binding protein" GI="330500924" GeneID="10454792"
>245014.CK3_35030 protein_id="CBL42879.1" product="Predicted transcrip
+tion factor, homolog of eukaryotic MBF1"
>245014.CK3_35040 protein_id="CBL42880.1" product="Bacterial protein o
+f unknown function (DUF961)."
[download]

here is a portion of the code I am using to extract either the GI or if there isn't a GI then the "protein_id", and use that as the new header. It seems to work for getting the GI, but it isn't recognizing any of the headers with just a protein_id

use strict;
use warnings;
use Data::Dumper;

my $in_fasta=$ARGV[0];
open(IN,$in_fasta) or die "cannot open $in_fasta";
my $out_fasta="$in_fasta.gi_header";
open(OUT,">",$out_fasta);
my $err_file="$in_fasta.headers_wNOgi";
open(ERR,">",$err_file);

my $header_count=0;
my $gi_count=0; my $protid_count=0;
my %giTogeneid;
while (my $line=<IN>){
                if ($line =~ /^>/ && $line =~ /^>.+GI="(\w+)"/){
                        my $gi=$1;
                        $gi_count ++;
                        my @header_columns=split(/\s+/,$line);
                        my $SM_geneid=$header_columns[0];
                        $SM_geneid=~s/^>//g;
                        $giTogeneid{$gi}=$SM_geneid;
                        print OUT ">gi$gi\n";
                }
                elsif ($line =~ /^>/ &&  $line !~ /^>.+GI="(\w+)"/ && 
+ $line !~ /^>.+protein_id="(\w+)"/){
                        print ERR "$line\n";
                }
                elsif ($line =~ /^>/ && $line =~ /^>.+protein_id="(\w+
+)"/){
                        my $protid=$1;
                        $protid_count++;
                        print OUT ">$protid\n";
                        my @header_columns=split(/\s+/,$line);
                        my $SM_geneid=$header_columns[0];
                        $SM_geneid=~s/^>//g;
                        $giTogeneid{$protid}=$SM_geneid;
                }

                elsif ($line !~ /^>/){
                        print OUT $line;
                }
}
close(IN);
print "number of headers seen\t$header_count\n";
print "number of gis seen\t$gi_count\n";
print "number of protids seen\t$protid_count\n";
close(OUT);
[download]

In reply to help with regex by AWallBuilder

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.