Extract from text file

mocnii has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

My file looks something like this (repeated many times for different genes with indicated tabs):

//
SPECIES\tCiona intestinalis
DEV_STAGE\tEarly tailbud
PREDICTED_GENE\tci0100148006
GENE_NAME\tci0100148006\Ms4a4b\Ms4a4c\Ms4a4d\
ORIGINAL_ANNOTATION/COMMENTS\tConversion to Aniseed... 
AUTHORS\tSatou Y, Takatori N,...
REFERENCES\tDevelopment. 2001 ;128(15):2893-904
URL_OF_ORIGINAL_ANNOTATION\thttp://ghost.zool.kyoto-u.ac.jp/indexr1.ht
+ml
ANISEED_ANNOTATION\tSTAINED_REGION:\thead endoderm\tSTAINED_MOL:\tci01
+00148006
ANISEED_ANNOTATION\tSTAINED_REGION:\ttail nerve cord\tSTAINED_MOL:\tci
+0100148006
IN_SITU_URL\thttp://aniseed-ibdm.univ-mrs.fr/insitu.php?id=2605647
//
[download]

and my code is here:

use strict;

my $ish = $ARGV[0];

open (ISH, "<", $ish) || die "$!";
open (OUT, ">", "out.txt") || die "$!";
{
$/ = "//";

while (<ISH>){

    m/DEV_STAGE\t(.*?)\n/g;
    my $a= $1;
    my $b;
    if (m/PREDICTED_GENE\t(.*?)\n/g){
    
    $b= $1;
}
    else{
        $b = '#';
    }
    
    while (m/\tSTAINED_REGION:\t(.*?)\tSTAINED_MOL:\t(.*?)\n/g) {
    
    my $c = $1;
    my $x = $2;
    print OUT "$b\t$a\t$c\t$x\n";

}
}    
}
[download]

Problem is that i only get last 2 columns (STAINED_REGION and STAINED_MOL) and i can't get first 2 columns (PREDICTED_GENE and DEV_STAGE). Could you point me what i'm doing wrong.

Found code that works :)

use strict;

my $ish = $ARGV[0];

open (ISH, "<", $ish) || die "$!";
open (OUT, ">", "out.txt") || die "$!";
{
$/ = "//\n";

while (<ISH>){

  m/DEV_STAGE\t(.*?)\n/g;
  my $a= $1;
  my $b;
  if (m/PREDICTED_GENE\t(.*?)\n/g){
  
  $b= $1;
}
  else{
    $b = '#';
  }
  
  while (m/\tSTAINED_REGION:\t(.*?)\tSTAINED_MOL:\t(.*?)\n/g) {
  
  my $c = $1;
  my $x = $2;
  print OUT "$b\t$a\t$c\t$x\n";

}
}
}
[download]

Comment on Extract from text file Select or Download Code

Replies are listed 'Best First'.
Re: Extract from text file by trizen (Hermit) on Oct 12, 2011 at 17:43 UTC
The problem is with the http links from the block content. Your block is from // to //, but 'http://' contains double slash too, so there is your premature end of block. The fix will be to change $/ to: `$/ = "//\n";`	[reply] [d/l]
Re: Extract from text file by pvaldes (Chaplain) on Oct 12, 2011 at 18:45 UTC
Maybe you are overwriting $1, your loop seems strange to me `while (<ISH>){ my $a; my $b = '#'; my $c; my $x; if (m/DEV_STAGE\t(.?)\n/g) {$a= $1; print OUT "a is: ",$a} elsif (m/PREDICTED_GENE\t(.?)\n/g){$b =$1; print OUT "b is: ",$b} elsif (m/\tSTAINED_REGION:\t(.?)\tSTAINED_MOL:\t(.?)\n/g) { $c = $1; print OUT "c is: ",$c; $x = $2; print OUT "x is: ",$x; } else {} }` [download] but your real problem is probably this: `open (OUT, ">", "out.txt")` You make a first pass to the loop and print to OUT Then you examine the second line, make a second loop and OVERWRITE the output file with the actual values, $a is lost now Then you make a third pass and OVERWRITE AGAIN, $b is lost... etc Solution: use '>>' instead '>' and put each print inside its own if-elsif block	[reply] [d/l] [select]
Re: Extract from text file by flexvault (Monsignor) on Oct 12, 2011 at 19:48 UTC
I started this before the other PMs gave an answer, so this is just another way to do it! use strict; use warnings; my $ish = $ARGV[0]; if ( ! defined $ish ) { die "1: $!"; } ## It helps to know which d +ie? open (ISH, "<", $ish) \|\| die "2: $!"; open (OUT, ">", "out.txt") \|\| die "3: $!"; open (my $LOG, ">", "log.txt") \|\| die "4: $!"; # $/ = "//"; my $no = 0; our %hash = (); while (<ISH>) { my $var = $_; $no++; print $LOG "$no\t$var"; ## This is to help know that you are readi +ng and what it is chomp($var); if ( $var eq "//" ) { if ( %hash ) { my ( $a, $b, $c, $x ) = Process_Hash(); ## You could do more work on %hash here or move the sub her +e print OUT "$b\t$a\t$c\t$x\n"; } %hash = (); ## Clear %hash for next sequence next; } ## Note: If you data contains real tabs (0x09), then make the '\\' +a '\' my ( $key, $value ) = split(/\\t/,$var ); # print $LOG "\t$var => \|$key\|$value\|\n"; $hash{ $key } = $value; } if ( %hash ) { my ( $a, $b, $c, $x ) = Process_Hash(); ## You could do more work on %hash here or move the sub her +e print OUT "$b\t$a\t$c\t$x\n"; } exit; # optional but I like to see it and be able to search on. ## Process the element of %hash sub Process_Hash { my ( $a, $b, $c, $x ) = ( "", "#", "for-you", "for-you" ); if ( defined $hash{"DEV_STAGE"} ) { $a = $hash{"DEV_STAGE"}; } if ( defined $hash{"PREDICTED_GENE"} ) { $b = $hash{"PREDICTED_GENE"}; } # while (m/\tSTAINED_REGION:\\t(.?)\tSTAINED_MOL:\t(.?)\n/g) return( $a, $b, $c, $x ); } 1; [download] If you start using hashes, you'll find that they save you a lot of code and help solve some very difficult problems. But there is no right or wrong, as long as it works correctly. Also, I left in the script the log that helped me figure what you were trying to do. Using a log will help you generate solid code. Delete or comment out when it's working the way you want. I usually declare a '$Debug' variable with a debug value. Once it works, I just declare `my $Debug = 0; ## 0-Clean 1-light debugging 2- ...` [download] That way, if I working on or adding to the script, I can set $Debug to 4, and have a log of what's going on. (Note: This is my technique and not necessarily what others would do.) Good Luck! "Well done is better than well said." - Benjamin Franklin	[reply] [d/l] [select]