Re: tab delimited extraction, formatting the output

zzgulu, I'm very happy to continue helping. Two things, though: 1) This thread has gotten long enough and pushed so far to the right that I'm responding to your original message. 2) It seems your current code is not in sync with your original data example, so it would be helpful to re-post them.

To your last question, how to print one value only if you have another value: You should assemble a record as you process the lines; then undef the record when you get to the end-of-record marker; perhaps printing the record first if it has all the elements.

I mocked up the following, but again, it doesn't work with your data sample. Either the data or the expressions need to change.

use strict;
use warnings;

my $file = "z.txt";
open my $fh, "<", $file or die "Unable to open $file: $!";
my %record;
while (<$fh>) {
    chomp;
    if (/EOU/){
        if (exists $record{'u_val'} &&
            exists $record{'p_val'} &&
            exists $record{'m_val'} )
        {
           print "$record{'u_val'}\n",
                 "\t$record{'p_val'}",
                 "\t$record{'m_val'}\n";
        }
        %record = ();
    } elsif (s/\bProcessing\s\d+\.tx\.\d+: //) {
        $record{'u_val'} = $_;
    } elsif (s/\bPhrase: //) {
        s/\"//g;
        $record{'p_val'} = $_;
    } elsif (/\s\s/) {
        $record{'m_val'} = $_;
    }
}
close $fh;
[download]

Comment on Re: tab delimited extraction, formatting the output Download Code

Replies are listed 'Best First'.
Re^2: tab delimited extraction, formatting the output by zzgulu (Novice) on Feb 16, 2009 at 18:15 UTC
Thank you hbm for your great tips. Here is the code and output that I get for a sample record. My question was how can I eliminate the second phrase ("at the time") from the output since there is no mapping available for that phrase. #!/usr/bin/perl use strict; use warnings; my $file = "regular.txt.out"; open my $fh, "<", $file or die "Unable to open $file: $!"; my($u_value, $p_value, $mc_value) = (undef) x 3; while (my $line=<$fh>) { chomp $line; if ($line=~/\n\n\n/){ ($u_value, $p_value, $mc_value) = (undef) x 3; print "\n"; } elsif ($line=~/\bProcessing\s/) { $line=~s/\bProcessing\s\d+\.tx\.\d+: //; $u_value = $line; print "\n$u_value\n"; undef $p_value; } elsif ($line=~/\bPhrase/) { $line=~s/\bPhrase: //; $line=~s/\"//g; if ($p_value) { print "\n" . ' ' x length $u_value;} $p_value=$line; print "\t$p_value"; undef $mc_value; } elsif ($line=~/\s\s/ ) { if ($mc_value) { print "\n\t" . ' ' x length $p_value;} $mc_value=$line; print "\t$mc_value"; } else { } } close $fh; [download] sample text for process Processing 00000000.tx.3: Pulmonary embolism at the time of hip replacement Phrase: "Pulmonary embolism" Meta Mapping (1000) (\s\s)1000 D0076131:PULMONARY EMBOLISM Disease or Syndrome Phrase: "at the time" Meta Candidates (0): <none> Meta Mappings: <none> Phrase: "of hip replacement" Meta Mapping (1000) (\s\s)1000 D0554893:HIP REPLACEMENT (STATUS POST HIP REPLACEMENT) Finding output of the processd block is: Pulmonary embolism at the time of hip replacement <tab>Pulmonary embolism<tab>1000 D0076131:PULMONARY EMBOLISM Disease or Syndrome <tab>at the time <tab>of hip replacement<tab>1000 D0554893:HIP REPLACEMENT (STATUS POST HIP REPLACEMENT) Finding	[reply] [d/l]
Re^3: tab delimited extraction, formatting the output by hbm (Hermit) on Feb 16, 2009 at 21:34 UTC
Somewhere along the way, your record separator changed from "EOU" to "\n\n\n". The first, presumably being unique in the data, was simple to test for. The latter is more difficult since you are processing the file one line at a time. I suppose you could keep a counter to see if three lines in a row contain nothing but a newline; but, it would be easier to change the record-separator variable (`$/`) to '\n\n\n'. I've done that below. Also, '(\s\s)' is a glaring string to have in your data! You do know that all those characters have special meaning in a regular expression: `/(\s\s)/` matches and keeps two whitespace characters. So to match them literally, you need to escape them. I've done that below too. (In my previous reply, I did think to myself that `/\s\s/` was not a very definitive expression, but I kept it because you had it.) I think the following does what you want. Note two assumptions: 1) That you always want to print the $u_val, even if there isn't a following $p_val/$m_val pair. 2) That $m_val always follows $p_val. use strict; use warnings; my $file = "z.txt"; open my $fh, "<", $file or die "Unable to open $file: $!"; my ($p_val, $m_val); { local $/ = '\n\n\n'; while (<$fh>) { # read until three newlines foreach (split/\n/) { # split it into individual lines if (s/\bProcessing\s\d+\.tx\.\d+: //) { print "$_\n"; # print the u_value immediately } elsif (s/\bPhrase: //) { s/"//g; # note: no need to escape " $p_val = $_; # keep it } elsif (/$\\s\\s$/) { $m_val = $_; # keep it if (defined $p_val) { print "\t$p_val\t$m_val\n"; # print if we have both } ($p_val, $m_val) = (undef)x2; } } } } close $fh; __END__ Pulmonary embolism at the time of hip replacement Pulmonary embolism (\s\s)1000 D0076131:PULMONARY EMBOLISM + Disease or Syndrome of hip replacement (\s\s)1000 D0554893:HIP REPLACEMENT (S +TATUS POST HIP REPLACEMENT) Finding [download]	[reply] [d/l]
Re^4: tab delimited extraction, formatting the output by zzgulu (Novice) on Feb 16, 2009 at 23:15 UTC
Thank you hbm. Yes, my record has changed. I started with a tab delimited file with "EOF." record separator but it didn't work out since the script processed only up to 14,000 lines and then exit without any error. Plus, I was missing one of the components in my original text. So, I started working on another output format from the same text. In the new format, records are separated with two blank lines so I thought to use \n\n\n as the separator For some reason when I tried to post my question spaces were being eliminated in my post so by typing (\s\s) I was trying to imply that in my real record there are two spaces before my mapping starts.Sorry, my bad! Thank you again for all your help and explanation. Is there any book you recommend for a Perl beginner like me to start with? I started with Perl Programming for Medicine and Biology" but I think some basic (and fundamental) concepts are not covered in that book.	[reply]
Re^5: tab delimited extraction, formatting the output by Anonymous Monk on Feb 17, 2009 at 14:00 UTC
Re^6: tab delimited extraction, formatting the output by hbm (Hermit) on Feb 17, 2009 at 16:58 UTC
Some notes below your chosen depth have not been shown here
Re^5: tab delimited extraction, formatting the output by hbm (Hermit) on Feb 17, 2009 at 14:20 UTC