AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:
Hi all, I am trying to parse and process a fasta file. Basically I need to 'simplify' the headers. I want to extract the GI (or protein_id, if there isn't a GI) from the current header, and then create a new file where just the GI or protein_id is the header. It is mostly working fine, except that I am not capturing the "protein_id". I guess I am having problems with my regex
here are some example headers
>1001585.MDS_0001 protein_id="YP_004377784.1" product="chromosomal rep +lication initiation protein" GI="330500915" GeneID="10459818" >1001585.MDS_0002 protein_id="YP_004377785.1" product="DNA polymerase +III subunit beta" GI="330500916" GeneID="10454784" >1001585.MDS_0003 protein_id="YP_004377786.1" product="recombination p +rotein F" GI="330500917" GeneID="10454785" >1001585.MDS_0004 protein_id="YP_004377787.1" product="DNA gyrase subu +nit B" GI="330500918" GeneID="10454786" >1001585.MDS_0005 protein_id="YP_004377788.1" GI="330500919" GeneID="1 +0454787" >1001585.MDS_0006 protein_id="YP_004377789.1" GI="330500920" GeneID="1 +0454788" >1001585.MDS_0007 protein_id="YP_004377790.1" GI="330500921" GeneID="1 +0454789" >1001585.MDS_0008 protein_id="YP_004377791.1" GI="330500922" GeneID="1 +0454790" >1001585.MDS_0009 protein_id="YP_004377792.1" product="ABC transporter + permease" GI="330500923" GeneID="10454791" >1001585.MDS_0010 protein_id="YP_004377793.1" product="ABC transporter + ATP-binding protein" GI="330500924" GeneID="10454792" >245014.CK3_35030 protein_id="CBL42879.1" product="Predicted transcrip +tion factor, homolog of eukaryotic MBF1" >245014.CK3_35040 protein_id="CBL42880.1" product="Bacterial protein o +f unknown function (DUF961)."
here is a portion of the code I am using to extract either the GI or if there isn't a GI then the "protein_id", and use that as the new header. It seems to work for getting the GI, but it isn't recognizing any of the headers with just a protein_id
use strict; use warnings; use Data::Dumper; my $in_fasta=$ARGV[0]; open(IN,$in_fasta) or die "cannot open $in_fasta"; my $out_fasta="$in_fasta.gi_header"; open(OUT,">",$out_fasta); my $err_file="$in_fasta.headers_wNOgi"; open(ERR,">",$err_file); my $header_count=0; my $gi_count=0; my $protid_count=0; my %giTogeneid; while (my $line=<IN>){ if ($line =~ /^>/ && $line =~ /^>.+GI="(\w+)"/){ my $gi=$1; $gi_count ++; my @header_columns=split(/\s+/,$line); my $SM_geneid=$header_columns[0]; $SM_geneid=~s/^>//g; $giTogeneid{$gi}=$SM_geneid; print OUT ">gi$gi\n"; } elsif ($line =~ /^>/ && $line !~ /^>.+GI="(\w+)"/ && + $line !~ /^>.+protein_id="(\w+)"/){ print ERR "$line\n"; } elsif ($line =~ /^>/ && $line =~ /^>.+protein_id="(\w+ +)"/){ my $protid=$1; $protid_count++; print OUT ">$protid\n"; my @header_columns=split(/\s+/,$line); my $SM_geneid=$header_columns[0]; $SM_geneid=~s/^>//g; $giTogeneid{$protid}=$SM_geneid; } elsif ($line !~ /^>/){ print OUT $line; } } close(IN); print "number of headers seen\t$header_count\n"; print "number of gis seen\t$gi_count\n"; print "number of protids seen\t$protid_count\n"; close(OUT);
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: help with regex
by JavaFan (Canon) on May 07, 2012 at 12:20 UTC | |
|
Re: help with regex
by Anonymous Monk on May 07, 2012 at 17:05 UTC |