Re^2: tab delimited extraction, formatting the output

Thank you hbm for your great tips. Here is the code and output that I get for a sample record. My question was how can I eliminate the second phrase ("at the time") from the output since there is no mapping available for that phrase.

#!/usr/bin/perl
use strict;
use warnings;
 
my $file = "regular.txt.out";
open my $fh, "<", $file or die "Unable to open $file: $!";
my($u_value, $p_value, $mc_value) = (undef) x 3;
 
while (my $line=<$fh>) {
    chomp $line;
    if ($line=~/\n\n\n/){
        ($u_value, $p_value, $mc_value) = (undef) x 3;
 
        print "\n";
    } elsif ($line=~/\bProcessing\s/) {
       $line=~s/\bProcessing\s\d+\.tx\.\d+: //;
       $u_value = $line;
        print "\n$u_value\n";
        undef $p_value;
    } elsif ($line=~/\bPhrase/) {
        $line=~s/\bPhrase: //;
        $line=~s/\"//g;
        if ($p_value) {
        print "\n" . ' ' x length $u_value;}
        $p_value=$line;
        print "\t$p_value";
        undef $mc_value;
     } elsif ($line=~/\s\s/ ) {
        if ($mc_value) {
        print "\n\t" . ' ' x length $p_value;}
        $mc_value=$line;
        print "\t$mc_value";
       } else {
     }
}
close $fh;
[download]

sample text for process

Processing 00000000.tx.3: Pulmonary embolism at the time of hip replacement

Phrase: "Pulmonary embolism"
Meta Mapping (1000)
(\s\s)1000 D0076131:PULMONARY EMBOLISM Disease or Syndrome

Phrase: "at the time"
Meta Candidates (0): <none>
Meta Mappings: <none>

Phrase: "of hip replacement"
Meta Mapping (1000)
(\s\s)1000 D0554893:HIP REPLACEMENT (STATUS POST HIP REPLACEMENT) Finding

output of the processd block is:

Pulmonary embolism at the time of hip replacement
<tab>Pulmonary embolism<tab>1000 D0076131:PULMONARY EMBOLISM Disease or Syndrome
<tab>at the time
<tab>of hip replacement<tab>1000 D0554893:HIP REPLACEMENT (STATUS POST HIP REPLACEMENT) Finding

Comment on Re^2: tab delimited extraction, formatting the output Download Code

Replies are listed 'Best First'.
Re^3: tab delimited extraction, formatting the output by hbm (Hermit) on Feb 16, 2009 at 21:34 UTC
Somewhere along the way, your record separator changed from "EOU" to "\n\n\n". The first, presumably being unique in the data, was simple to test for. The latter is more difficult since you are processing the file one line at a time. I suppose you could keep a counter to see if three lines in a row contain nothing but a newline; but, it would be easier to change the record-separator variable (`$/`) to '\n\n\n'. I've done that below. Also, '(\s\s)' is a glaring string to have in your data! You do know that all those characters have special meaning in a regular expression: `/(\s\s)/` matches and keeps two whitespace characters. So to match them literally, you need to escape them. I've done that below too. (In my previous reply, I did think to myself that `/\s\s/` was not a very definitive expression, but I kept it because you had it.) I think the following does what you want. Note two assumptions: 1) That you always want to print the $u_val, even if there isn't a following $p_val/$m_val pair. 2) That $m_val always follows $p_val. use strict; use warnings; my $file = "z.txt"; open my $fh, "<", $file or die "Unable to open $file: $!"; my ($p_val, $m_val); { local $/ = '\n\n\n'; while (<$fh>) { # read until three newlines foreach (split/\n/) { # split it into individual lines if (s/\bProcessing\s\d+\.tx\.\d+: //) { print "$_\n"; # print the u_value immediately } elsif (s/\bPhrase: //) { s/"//g; # note: no need to escape " $p_val = $_; # keep it } elsif (/$\\s\\s$/) { $m_val = $_; # keep it if (defined $p_val) { print "\t$p_val\t$m_val\n"; # print if we have both } ($p_val, $m_val) = (undef)x2; } } } } close $fh; __END__ Pulmonary embolism at the time of hip replacement Pulmonary embolism (\s\s)1000 D0076131:PULMONARY EMBOLISM + Disease or Syndrome of hip replacement (\s\s)1000 D0554893:HIP REPLACEMENT (S +TATUS POST HIP REPLACEMENT) Finding [download]	[reply] [d/l]
Re^4: tab delimited extraction, formatting the output by zzgulu (Novice) on Feb 16, 2009 at 23:15 UTC
Thank you hbm. Yes, my record has changed. I started with a tab delimited file with "EOF." record separator but it didn't work out since the script processed only up to 14,000 lines and then exit without any error. Plus, I was missing one of the components in my original text. So, I started working on another output format from the same text. In the new format, records are separated with two blank lines so I thought to use \n\n\n as the separator For some reason when I tried to post my question spaces were being eliminated in my post so by typing (\s\s) I was trying to imply that in my real record there are two spaces before my mapping starts.Sorry, my bad! Thank you again for all your help and explanation. Is there any book you recommend for a Perl beginner like me to start with? I started with Perl Programming for Medicine and Biology" but I think some basic (and fundamental) concepts are not covered in that book.	[reply]
Re^5: tab delimited extraction, formatting the output by Anonymous Monk on Feb 17, 2009 at 14:00 UTC
it never ends! I was validating the data and noticed that there are ocasions where more than one mapping exists like below. I thought I could simply add /g to (/\s\s/) but apparently that doesn't do the work. Any suggestion? Phrase: "of hemorrhage" Meta Mapping (1000) <2 spaces>1000 D0046004:HAEMORRHAGE (BLEEDING) Finding Meta Mapping (1000) <2 spaces>1000 D0046011:HAEMORRHAGE NOT OTHERWISE SPECIFIED (HEMORRHAGE NOT OTHERWISE SPECIFIED) Finding	[reply]
Re^6: tab delimited extraction, formatting the output by hbm (Hermit) on Feb 17, 2009 at 16:58 UTC
Re^7: tab delimited extraction, formatting the output by zzgulu (Novice) on Feb 17, 2009 at 18:25 UTC
Re^5: tab delimited extraction, formatting the output by hbm (Hermit) on Feb 17, 2009 at 14:20 UTC
I do recommend O'Reilly's Perl Cookbook. I have the second edition, and its chapter titles include: Strings; Numbers; Dates and Times; Arrays; Hashes; Pattern Matching; File Access; Subroutines; References and Records; Database Access; Process Management; CGI Programming; and XML. It provides many examples and has a robust 30-page index, and certainly covers all the questions you've raised in this thread.	[reply]