cowboyrocks has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I need some clue about howto format this sample case. How can I run the if loop in reverse ? I guess that may solve the problem ... :-)
My input looks like this:-
NT_113797 CDS 122829 123323 - gene=LOC644591 ProteinID=X +P_932799.1 NT_113798 CDS 4457 4636 - NT_077932 CDS 9894 9928 - NT_077932 CDS 65297 65828 + NT_077932 CDS 89196 89690 - gene=LOC653505 ProteinID=BJD +ND993
My output looks like this:-
NT_113797 CDS 122829 123323 - NT_113798 CDS 4457 4636 - gene=LOC644591 NT_077932 CDS 9894 9928 - gene=LOC644591 NT_077932 CDS 65297 65828 + gene=LOC644591 NT_077932 CDS 89196 89690 - gene=LOC644591
I want it to be like this:-
NT_113797 CDS 122829 123323 - gene=LOC644591 NT_113798 CDS 4457 4636 - gene=LOC653505 NT_077932 CDS 9894 9928 - gene=LOC653505 NT_077932 CDS 65297 65828 + gene=LOC653505 NT_077932 CDS 89196 89690 - gene=LOC653505
My code looks something like this:-
#!/usr/bin/perl use warnings; use strict; my $fn = $ARGV[0]; open(FH, "$fn") || die("cannot open:$!"); { my $geneName = ""; while(<FH>) { if($_ =~ /\A(\S+)\t(\S+)\t(\d+)\t(\d+)\t(\S)\s+$/) { print "\n$_ $geneName"; } if($_ =~ /\A(\S+)\t(\S+)\t(\d+)\t(\d+)\t(\S)\s+(\S+)\s+(\S+)\s+ +/) { $geneName = $6; } } }
Thanks in advance
cowboy :-)

Replies are listed 'Best First'.
Re: Parsing help
by GrandFather (Saint) on Apr 01, 2009 at 04:04 UTC

    Since it's not clear from your sample data where the tabs ought be (and it doesn't matter for demonstration purposes anyway) I've changed the sample code to use spaces instead and changed it to get input from the __DATA__ section:

    use strict; use warnings; my @partsList; push @partsList, [split] while <DATA>; my $geneName = ''; $geneName = $_->[5] ||= $geneName for reverse @partsList; print join ("\t", @{$_}[0 .. 5]), "\n" for @partsList; __DATA__ NT_113797 CDS 122829 123323 - gene=LOC644591 ProteinID=X +P_932799.1 NT_113798 CDS 4457 4636 - NT_077932 CDS 9894 9928 - NT_077932 CDS 65297 65828 + NT_077932 CDS 89196 89690 - gene=LOC653505 ProteinID=BJD +ND993

    Prints:

    NT_113797 CDS 122829 123323 - gene=LOC644591 NT_113798 CDS 4457 4636 - gene=LOC653505 NT_077932 CDS 9894 9928 - gene=LOC653505 NT_077932 CDS 65297 65828 + gene=LOC653505 NT_077932 CDS 89196 89690 - gene=LOC653505

    True laziness is hard work

      Very elegant solution, but may has the drawback of having to load the whole file in memory and traverse the whole list 3 times (1 - reading, 2 - filling the field #5 and 3 - printing).

      Since the size of this kind of genomic files may be an issue, here is another version a bit more resource-friendly:

      use strict; use warnings; my @acc = (); while (<DATA>) { my @recs = split; push @acc, [@recs]; if (my $geneName = $recs[5]) { print join ("\t", @{$_}[0 .. 4], $geneName, "\n") for @acc; @acc = (); } } __DATA__ NT_113797 CDS 122829 123323 - gene=LOC644591 ProteinID=X +P_932799.1 NT_113798 CDS 4457 4636 - NT_077932 CDS 9894 9928 - NT_077932 CDS 65297 65828 + NT_077932 CDS 89196 89690 - gene=LOC653505 ProteinID=BJD +ND993

      Outputs the desired result:

      NT_113797 CDS 122829 123323 - gene=LOC644591 NT_113798 CDS 4457 4636 - gene=LOC653505 NT_077932 CDS 9894 9928 - gene=LOC653505 NT_077932 CDS 65297 65828 + gene=LOC653505 NT_077932 CDS 89196 89690 - gene=LOC653505 p

      Hope this helps

      citromatik

Re: Parsing help
by Anonymous Monk on Apr 01, 2009 at 03:39 UTC
    FWIW, that seems like the wrong approach to take. You should be using something that actually knows how to parse that record format (BioPerl ...).

    Now to your code, gene=LOC653505 comes after gene=LOC644591, so you have to parse the file twice, example:

    #!/usr/bin/perl -- use strict; use warnings; sub scan_gene { my $fh = shift; my $tell = tell $fh; my @gene; while(readline $fh){ push @gene, $1 if /gene=(\S+)/; } seek $fh, $tell, 0; return @gene; } my @gene = scan_gene(\*DATA); my $geneix = 0; #warn "@gene"; while(<DATA>){ my( $before, $sign, $after ) = split /([+-])/, $_, 2; print "$before $sign $gene[$geneix]\n"; $geneix++ if index($after, $gene[$geneix]) != -1; # warn "before $before\nsign $sign\nafter $after"; } __DATA__ NT_113797 CDS 122829 123323 - gene=LOC644591 ProteinID=X +P_932799.1 NT_113798 CDS 4457 4636 - NT_077932 CDS 9894 9928 - NT_077932 CDS 65297 65828 + NT_077932 CDS 89196 89690 - gene=LOC653505 ProteinID=BJD +ND993