help with regex

AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am trying to parse and process a fasta file. Basically I need to 'simplify' the headers. I want to extract the GI (or protein_id, if there isn't a GI) from the current header, and then create a new file where just the GI or protein_id is the header. It is mostly working fine, except that I am not capturing the "protein_id". I guess I am having problems with my regex

here are some example headers

>1001585.MDS_0001 protein_id="YP_004377784.1" product="chromosomal rep
+lication initiation protein" GI="330500915" GeneID="10459818"
>1001585.MDS_0002 protein_id="YP_004377785.1" product="DNA polymerase 
+III subunit beta" GI="330500916" GeneID="10454784"
>1001585.MDS_0003 protein_id="YP_004377786.1" product="recombination p
+rotein F" GI="330500917" GeneID="10454785"
>1001585.MDS_0004 protein_id="YP_004377787.1" product="DNA gyrase subu
+nit B" GI="330500918" GeneID="10454786"
>1001585.MDS_0005 protein_id="YP_004377788.1" GI="330500919" GeneID="1
+0454787"
>1001585.MDS_0006 protein_id="YP_004377789.1" GI="330500920" GeneID="1
+0454788"
>1001585.MDS_0007 protein_id="YP_004377790.1" GI="330500921" GeneID="1
+0454789"
>1001585.MDS_0008 protein_id="YP_004377791.1" GI="330500922" GeneID="1
+0454790"
>1001585.MDS_0009 protein_id="YP_004377792.1" product="ABC transporter
+ permease" GI="330500923" GeneID="10454791"
>1001585.MDS_0010 protein_id="YP_004377793.1" product="ABC transporter
+ ATP-binding protein" GI="330500924" GeneID="10454792"
>245014.CK3_35030 protein_id="CBL42879.1" product="Predicted transcrip
+tion factor, homolog of eukaryotic MBF1"
>245014.CK3_35040 protein_id="CBL42880.1" product="Bacterial protein o
+f unknown function (DUF961)."
[download]

here is a portion of the code I am using to extract either the GI or if there isn't a GI then the "protein_id", and use that as the new header. It seems to work for getting the GI, but it isn't recognizing any of the headers with just a protein_id

use strict;
use warnings;
use Data::Dumper;

my $in_fasta=$ARGV[0];
open(IN,$in_fasta) or die "cannot open $in_fasta";
my $out_fasta="$in_fasta.gi_header";
open(OUT,">",$out_fasta);
my $err_file="$in_fasta.headers_wNOgi";
open(ERR,">",$err_file);

my $header_count=0;
my $gi_count=0; my $protid_count=0;
my %giTogeneid;
while (my $line=<IN>){
                if ($line =~ /^>/ && $line =~ /^>.+GI="(\w+)"/){
                        my $gi=$1;
                        $gi_count ++;
                        my @header_columns=split(/\s+/,$line);
                        my $SM_geneid=$header_columns[0];
                        $SM_geneid=~s/^>//g;
                        $giTogeneid{$gi}=$SM_geneid;
                        print OUT ">gi$gi\n";
                }
                elsif ($line =~ /^>/ &&  $line !~ /^>.+GI="(\w+)"/ && 
+ $line !~ /^>.+protein_id="(\w+)"/){
                        print ERR "$line\n";
                }
                elsif ($line =~ /^>/ && $line =~ /^>.+protein_id="(\w+
+)"/){
                        my $protid=$1;
                        $protid_count++;
                        print OUT ">$protid\n";
                        my @header_columns=split(/\s+/,$line);
                        my $SM_geneid=$header_columns[0];
                        $SM_geneid=~s/^>//g;
                        $giTogeneid{$protid}=$SM_geneid;
                }

                elsif ($line !~ /^>/){
                        print OUT $line;
                }
}
close(IN);
print "number of headers seen\t$header_count\n";
print "number of gis seen\t$gi_count\n";
print "number of protids seen\t$protid_count\n";
close(OUT);
[download]

Comment on help with regex Select or Download Code

Replies are listed 'Best First'.
Re: help with regex by JavaFan (Canon) on May 07, 2012 at 12:20 UTC
While you're including sample input, you're not showing what the intended output is. So, I had to guess. Perhaps the following will work for you: #!/usr/bin/perl use 5.010; use strict; use warnings; while (<DATA>) { my %chunks = /(\S+)="([^"]+)"/g; my $header = delete $chunks{GI} \|\| delete $chunks{protein_id} or n +ext; print ">$header"; print ' ', $_, '="', $chunks{$_}, '"' for keys %chunks; print "\n"; } __DATA__ >1001585.MDS_0001 protein_id="YP_004377784.1" product="chromosomal rep +lication initiation protein" GI="330500915" GeneID="10459818" >1001585.MDS_0002 protein_id="YP_004377785.1" product="DNA polymerase +III subunit beta" GI="330500916" GeneID="10454784" >1001585.MDS_0003 protein_id="YP_004377786.1" product="recombination p +rotein F" GI="330500917" GeneID="10454785" >1001585.MDS_0004 protein_id="YP_004377787.1" product="DNA gyrase subu +nit B" GI="330500918" GeneID="10454786" >1001585.MDS_0005 protein_id="YP_004377788.1" GI="330500919" GeneID="1 +0454787" >1001585.MDS_0006 protein_id="YP_004377789.1" GI="330500920" GeneID="1 +0454788" >1001585.MDS_0007 protein_id="YP_004377790.1" GI="330500921" GeneID="1 +0454789" >1001585.MDS_0008 protein_id="YP_004377791.1" GI="330500922" GeneID="1 +0454790" >1001585.MDS_0009 protein_id="YP_004377792.1" product="ABC transporter + permease" GI="330500923" GeneID="10454791" >1001585.MDS_0010 protein_id="YP_004377793.1" product="ABC transporter + ATP-binding protein" GI="330500924" GeneID="10454792" >245014.CK3_35030 protein_id="CBL42879.1" product="Predicted transcrip +tion factor, homolog of eukaryotic MBF1" >245014.CK3_35040 protein_id="CBL42880.1" product="Bacterial protein o +f unknown function (DUF961)." [download] Which gives as output: >330500915 protein_id="YP_004377784.1" product="chromosomal replicatio +n initiation protein" GeneID="10459818" >330500916 protein_id="YP_004377785.1" product="DNA polymerase III sub +unit beta" GeneID="10454784" >330500917 protein_id="YP_004377786.1" product="recombination protein +F" GeneID="10454785" >330500918 protein_id="YP_004377787.1" product="DNA gyrase subunit B" +GeneID="10454786" >330500919 protein_id="YP_004377788.1" GeneID="10454787" >330500920 protein_id="YP_004377789.1" GeneID="10454788" >330500921 protein_id="YP_004377790.1" GeneID="10454789" >330500922 protein_id="YP_004377791.1" GeneID="10454790" >330500923 protein_id="YP_004377792.1" product="ABC transporter permea +se" GeneID="10454791" >330500924 protein_id="YP_004377793.1" product="ABC transporter ATP-bi +nding protein" GeneID="10454792" >CBL42879.1 product="Predicted transcription factor, homolog of eukary +otic MBF1" >CBL42880.1 product="Bacterial protein of unknown function (DUF961)." [download] It removes the GI or protein_id from the line, and doesn't keep the order of the fields. It also assumes all values are enclosed by double quotes (they all do in the input). It also assumes no GI or protein_id is 0.	[reply] [d/l] [select]
Re: help with regex by Anonymous Monk on May 07, 2012 at 17:05 UTC
A search for "fasta" at http://search.cpan.org produced 286 hits today. Will none of these apply to you?	[reply]