cowboyrocks has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I am trying to parse a large file and filter according the fields I require.
My code looks like this:-
#!/usr/bin/perl $fn = $ARGV[0]; open(FH, "$fn") || die("cannot open:$!"); $\ = ' '; while(<FH>) { $\ = "\n" if eof; chomp($_); $_ =~ s/\(|\)//xmg; if($_ =~ /LOCUS\s+(\S+)/) { $gi=$1; } elsif($_ =~ /\A\s+(\w+)\s+(\S+)\.\.(\S+)/) { print "\t\t$gi +\t$1\t$2\t$3\n"; } elsif($_ =~ /gene\=\"(\S+)\"/) { print "\t\t$gi\t\t\t\t\tge +ne=$1\n"; } elsif($_ =~ /\/transcript_id\=\"(\S+)\"/) {print "\t\t$gi\t +\t\t\t\tTranscriptID=$1\n";} }
Input looks like below:-
gene 348634..348822 /gene="LOC100129305" /note="Derived by automated computational analysi +s using gene prediction method: GNOMON. Supporting eviden +ce includes similarity to: 1 Protein" /db_xref="GeneID:100129305" mRNA 348634..348822 /gene="LOC100129305" /product="similar to C21orf94 protein" /note="Derived by automated computational analysi +s using gene prediction method: GNOMON. Supporting eviden +ce includes similarity to: 1 Protein" /transcript_id="XM_001714760.1" /db_xref="GI:169215265" /db_xref="GeneID:100129305"
My ouput is coming like this(tab separated):-
gene 348634 348822 gene=LOC100129305 gene 348634 348822 gene=LOC100129305 TranscriptID=XM_001714760.1
Whereas I want it to come like like:-
gene 348634 348822 gene=LOC100129305 gene 348634 348822 gene=LOC100129305 TranscriptID= +XM_001714760.1
Thanks in advance
cowboy :)

Replies are listed 'Best First'.
Re: File Parsing issue
by ig (Vicar) on Mar 19, 2009 at 06:32 UTC

    You are very close to having what you want. The following slightly modified version of your program may help.

    #!/usr/bin/perl use warnings; use strict; { my $gi = ""; while (<DATA>) { chomp($_); $_ =~ s/\(|\)//xmg; if ( $_ =~ /LOCUS\s+(\S+)/ ) { $gi = $1; } elsif ( $_ =~ /\A\s+(\w+)\s+(\S+)\.\.(\S+)/ ) { print "\n\t\t$gi\t$1\t$2\t$3"; } elsif ( $_ =~ /gene\=\"(\S+)\"/ ) { print "\tgene=$1"; } elsif ( $_ =~ /\/transcript_id\=\"(\S+)\"/ ) { print "\tTranscriptID=$1"; } } print "\n"; } __DATA__ gene 348634..348822 /gene="LOC100129305" /note="Derived by automated computational analysi +s using gene prediction method: GNOMON. Supporting eviden +ce includes similarity to: 1 Protein" /db_xref="GeneID:100129305" mRNA 348634..348822 /gene="LOC100129305" /product="similar to C21orf94 protein" /note="Derived by automated computational analysi +s using gene prediction method: GNOMON. Supporting eviden +ce includes similarity to: 1 Protein" /transcript_id="XM_001714760.1" /db_xref="GI:169215265" /db_xref="GeneID:100129305"

    This produces

    gene 348634 348822 gene=LOC100129305 mRNA 348634 348822 gene=LOC100129305 + TranscriptID=XM_001714760.1

    update: removed the setting of $\ as it is not necessary.

    update2: I used perltidy to reformat your program. You might like to give it a try.

      Oh great :) Thats brilliant and thanks for the tip - perltidy.
      cowboy
Re: File Parsing issue
by bichonfrise74 (Vicar) on Mar 19, 2009 at 17:24 UTC
    Hi ig, I wrote the code in this manner... What is the pros / cons of using your code over this? Or do you think it is just the same?
    #!/usr/bin/perl use strict; while( <DATA> ) { if ( /\s+\w+\s+\d+\.\.\d+/ ... /\"GeneID:\d+\"/ ) { print "$1 $2 $3" if ( /\s+(\w+)\s+(\d+)\.\.(\d+)/ ); print " $1" if ( /\/(gene=.*)/ ); print " $1" if ( /\/(transcript_id=.*)/ ); print "\n" if ( /\"GeneID:\d+\"/ ); } }

      I made fairly minimal change to the original code, to help the OP see how close they were to a solution. My idea of optimum would almost certainly be different and others would have other very good ideas - as you have. Details of the results aside, what you have written is quite appealing: I find it easier to read and understand than the original.

      The code you wrote is not quite the same: for example you don't display the information following LOCUS, of which there is no example in the sample data provided, and you don't remove the quotes from the gene and transcript_id values.

      The input data looks like an excerpt from a GeneBank flat file record. These files are quite complex and rather than writing my own software I would probably turn to BioPerl for modules to read these files.