mocnii has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

My file looks something like this (repeated many times for different genes with indicated tabs):

// SPECIES\tCiona intestinalis DEV_STAGE\tEarly tailbud PREDICTED_GENE\tci0100148006 GENE_NAME\tci0100148006\Ms4a4b\Ms4a4c\Ms4a4d\ ORIGINAL_ANNOTATION/COMMENTS\tConversion to Aniseed... AUTHORS\tSatou Y, Takatori N,... REFERENCES\tDevelopment. 2001 ;128(15):2893-904 URL_OF_ORIGINAL_ANNOTATION\thttp://ghost.zool.kyoto-u.ac.jp/indexr1.ht +ml ANISEED_ANNOTATION\tSTAINED_REGION:\thead endoderm\tSTAINED_MOL:\tci01 +00148006 ANISEED_ANNOTATION\tSTAINED_REGION:\ttail nerve cord\tSTAINED_MOL:\tci +0100148006 IN_SITU_URL\thttp://aniseed-ibdm.univ-mrs.fr/insitu.php?id=2605647 //

and my code is here:

use strict; my $ish = $ARGV[0]; open (ISH, "<", $ish) || die "$!"; open (OUT, ">", "out.txt") || die "$!"; { $/ = "//"; while (<ISH>){ m/DEV_STAGE\t(.*?)\n/g; my $a= $1; my $b; if (m/PREDICTED_GENE\t(.*?)\n/g){ $b= $1; } else{ $b = '#'; } while (m/\tSTAINED_REGION:\t(.*?)\tSTAINED_MOL:\t(.*?)\n/g) { my $c = $1; my $x = $2; print OUT "$b\t$a\t$c\t$x\n"; } } }

Problem is that i only get last 2 columns (STAINED_REGION and STAINED_MOL) and i can't get first 2 columns (PREDICTED_GENE and DEV_STAGE). Could you point me what i'm doing wrong.

Found code that works :)

use strict; my $ish = $ARGV[0]; open (ISH, "<", $ish) || die "$!"; open (OUT, ">", "out.txt") || die "$!"; { $/ = "//\n"; while (<ISH>){ m/DEV_STAGE\t(.*?)\n/g; my $a= $1; my $b; if (m/PREDICTED_GENE\t(.*?)\n/g){ $b= $1; } else{ $b = '#'; } while (m/\tSTAINED_REGION:\t(.*?)\tSTAINED_MOL:\t(.*?)\n/g) { my $c = $1; my $x = $2; print OUT "$b\t$a\t$c\t$x\n"; } } }

Replies are listed 'Best First'.
Re: Extract from text file
by trizen (Hermit) on Oct 12, 2011 at 17:43 UTC
    The problem is with the http links from the block content. Your block is from // to //, but 'http://' contains double slash too, so there is your premature end of block. The fix will be to change $/ to: $/ = "//\n";
Re: Extract from text file
by pvaldes (Chaplain) on Oct 12, 2011 at 18:45 UTC

    Maybe you are overwriting $1, your loop seems strange to me

    while (<ISH>){ my $a; my $b = '#'; my $c; my $x; if (m/DEV_STAGE\t(.*?)\n/g) {$a= $1; print OUT "a is: ",$a} elsif (m/PREDICTED_GENE\t(.*?)\n/g){$b =$1; print OUT "b is: ",$b} elsif (m/\tSTAINED_REGION:\t(.*?)\tSTAINED_MOL:\t(.*?)\n/g) { $c = $1; print OUT "c is: ",$c; $x = $2; print OUT "x is: ",$x; } else {} }

    but your real problem is probably this:

    open (OUT, ">", "out.txt")

    You make a first pass to the loop and print to OUT

    Then you examine the second line, make a second loop and OVERWRITE the output file with the actual values, $a is lost now

    Then you make a third pass and OVERWRITE AGAIN, $b is lost... etc

    Solution: use '>>' instead '>' and put each print inside its own if-elsif block

Re: Extract from text file
by flexvault (Monsignor) on Oct 12, 2011 at 19:48 UTC

    I started this before the other PMs gave an answer, so this is just another way to do it!

    use strict; use warnings; my $ish = $ARGV[0]; if ( ! defined $ish ) { die "1: $!"; } ## It helps to know which d +ie? open (ISH, "<", $ish) || die "2: $!"; open (OUT, ">", "out.txt") || die "3: $!"; open (my $LOG, ">", "log.txt") || die "4: $!"; # $/ = "//"; my $no = 0; our %hash = (); while (<ISH>) { my $var = $_; $no++; print $LOG "$no\t$var"; ## This is to help know that you are readi +ng and what it is chomp($var); if ( $var eq "//" ) { if ( %hash ) { my ( $a, $b, $c, $x ) = Process_Hash(); ## You could do more work on %hash here or move the sub her +e print OUT "$b\t$a\t$c\t$x\n"; } %hash = (); ## Clear %hash for next sequence next; } ## Note: If you data contains real tabs (0x09), then make the '\\' +a '\' my ( $key, $value ) = split(/\\t/,$var ); # print $LOG "\t$var => |$key|$value|\n"; $hash{ $key } = $value; } if ( %hash ) { my ( $a, $b, $c, $x ) = Process_Hash(); ## You could do more work on %hash here or move the sub her +e print OUT "$b\t$a\t$c\t$x\n"; } exit; # optional but I like to see it and be able to search on. ## Process the element of %hash sub Process_Hash { my ( $a, $b, $c, $x ) = ( "", "#", "for-you", "for-you" ); if ( defined $hash{"DEV_STAGE"} ) { $a = $hash{"DEV_STAGE"}; } if ( defined $hash{"PREDICTED_GENE"} ) { $b = $hash{"PREDICTED_GENE"}; } # while (m/\tSTAINED_REGION:\\t(.*?)\tSTAINED_MOL:\t(.*?)\n/g) return( $a, $b, $c, $x ); } 1;

    If you start using hashes, you'll find that they save you a lot of code and help solve some very difficult problems. But there is no right or wrong, as long as it works correctly. Also, I left in the script the log that helped me figure what you were trying to do. Using a log will help you generate solid code. Delete or comment out when it's working the way you want. I usually declare a '$Debug' variable with a debug value. Once it works, I just declare

    my $Debug = 0; ## 0-Clean 1-light debugging 2- ...

    That way, if I working on or adding to the script, I can set $Debug to 4, and have a log of what's going on. (Note: This is my technique and not necessarily what others would do.)

    Good Luck!

    "Well done is better than well said." - Benjamin Franklin