File Parsing issue

cowboyrocks has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I am trying to parse a large file and filter according the fields I require.
My code looks like this:-

#!/usr/bin/perl

$fn = $ARGV[0];
open(FH, "$fn") || die("cannot open:$!");

          $\ = ' '; 
while(<FH>)
 {     
   
           $\ = "\n" if eof;
           chomp($_);
           $_ =~ s/\(|\)//xmg;
           if($_ =~ /LOCUS\s+(\S+)/) { $gi=$1; }
           elsif($_ =~ /\A\s+(\w+)\s+(\S+)\.\.(\S+)/) { print "\t\t$gi
+\t$1\t$2\t$3\n"; }
           elsif($_ =~ /gene\=\"(\S+)\"/) { print "\t\t$gi\t\t\t\t\tge
+ne=$1\n"; }
           elsif($_ =~ /\/transcript_id\=\"(\S+)\"/) {print "\t\t$gi\t
+\t\t\t\tTranscriptID=$1\n";}

 }
[download]

Input looks like below:-

     gene         348634..348822
                     /gene="LOC100129305"
                     /note="Derived by automated computational analysi
+s using
                     gene prediction method: GNOMON. Supporting eviden
+ce
                     includes similarity to: 1 Protein"
                     /db_xref="GeneID:100129305"
     mRNA         348634..348822
                     /gene="LOC100129305"
                     /product="similar to C21orf94 protein"
                     /note="Derived by automated computational analysi
+s using
                     gene prediction method: GNOMON. Supporting eviden
+ce
                     includes similarity to: 1 Protein"
                     /transcript_id="XM_001714760.1"
                     /db_xref="GI:169215265"
                     /db_xref="GeneID:100129305"
[download]

My ouput is coming like this(tab separated):-

           gene   348634   348822   
                                 gene=LOC100129305
           gene   348634   348822
                                 gene=LOC100129305
                                 TranscriptID=XM_001714760.1
[download]

Whereas I want it to come like like:-

           gene   348634   348822   gene=LOC100129305
           gene   348634   348822   gene=LOC100129305    TranscriptID=
+XM_001714760.1
[download]

Thanks in advance
cowboy :)

Comment on File Parsing issue Select or Download Code

Replies are listed 'Best First'.
Re: File Parsing issue by ig (Vicar) on Mar 19, 2009 at 06:32 UTC
You are very close to having what you want. The following slightly modified version of your program may help. #!/usr/bin/perl use warnings; use strict; { my $gi = ""; while (<DATA>) { chomp($_); $_ =~ s/$\|$//xmg; if ( $_ =~ /LOCUS\s+(\S+)/ ) { $gi = $1; } elsif ( $_ =~ /\A\s+(\w+)\s+(\S+)\.\.(\S+)/ ) { print "\n\t\t$gi\t$1\t$2\t$3"; } elsif ( $_ =~ /gene\=\"(\S+)\"/ ) { print "\tgene=$1"; } elsif ( $_ =~ /\/transcript_id\=\"(\S+)\"/ ) { print "\tTranscriptID=$1"; } } print "\n"; } __DATA__ gene 348634..348822 /gene="LOC100129305" /note="Derived by automated computational analysi +s using gene prediction method: GNOMON. Supporting eviden +ce includes similarity to: 1 Protein" /db_xref="GeneID:100129305" mRNA 348634..348822 /gene="LOC100129305" /product="similar to C21orf94 protein" /note="Derived by automated computational analysi +s using gene prediction method: GNOMON. Supporting eviden +ce includes similarity to: 1 Protein" /transcript_id="XM_001714760.1" /db_xref="GI:169215265" /db_xref="GeneID:100129305" [download] This produces `gene 348634 348822 gene=LOC100129305 mRNA 348634 348822 gene=LOC100129305 + TranscriptID=XM_001714760.1` [download] update: removed the setting of $\ as it is not necessary. update2: I used perltidy to reformat your program. You might like to give it a try.	[reply] [d/l] [select]
Re^2: File Parsing issue by cowboyrocks (Novice) on Mar 19, 2009 at 07:12 UTC
Oh great :) Thats brilliant and thanks for the tip - perltidy. cowboy	[reply]
Re: File Parsing issue by bichonfrise74 (Vicar) on Mar 19, 2009 at 17:24 UTC
Hi ig, I wrote the code in this manner... What is the pros / cons of using your code over this? Or do you think it is just the same? `#!/usr/bin/perl use strict; while( <DATA> ) { if ( /\s+\w+\s+\d+\.\.\d+/ ... /\"GeneID:\d+\"/ ) { print "$1 $2 $3" if ( /\s+(\w+)\s+(\d+)\.\.(\d+)/ ); print " $1" if ( /\/(gene=.)/ ); print " $1" if ( /\/(transcript_id=.)/ ); print "\n" if ( /\"GeneID:\d+\"/ ); } }` [download]	[reply] [d/l]
Re^2: File Parsing issue by ig (Vicar) on Mar 19, 2009 at 19:51 UTC
I made fairly minimal change to the original code, to help the OP see how close they were to a solution. My idea of optimum would almost certainly be different and others would have other very good ideas - as you have. Details of the results aside, what you have written is quite appealing: I find it easier to read and understand than the original. The code you wrote is not quite the same: for example you don't display the information following LOCUS, of which there is no example in the sample data provided, and you don't remove the quotes from the gene and transcript_id values. The input data looks like an excerpt from a GeneBank flat file record. These files are quite complex and rather than writing my own software I would probably turn to BioPerl for modules to read these files.	[reply]