comment on

Somewhere along the way, your record separator changed from "EOU" to "\n\n\n". The first, presumably being unique in the data, was simple to test for. The latter is more difficult since you are processing the file one line at a time. I suppose you could keep a counter to see if three lines in a row contain nothing but a newline; but, it would be easier to change the record-separator variable ($/) to '\n\n\n'. I've done that below.

Also, '(\s\s)' is a glaring string to have in your data! You do know that all those characters have special meaning in a regular expression: /(\s\s)/ matches and keeps two whitespace characters. So to match them literally, you need to escape them. I've done that below too. (In my previous reply, I did think to myself that /\s\s/ was not a very definitive expression, but I kept it because you had it.)

I think the following does what you want. Note two assumptions: 1) That you always want to print the $u_val, even if there isn't a following $p_val/$m_val pair. 2) That $m_val always follows $p_val.

use strict;
use warnings;

my $file = "z.txt";
open my $fh, "<", $file or die "Unable to open $file: $!";
my ($p_val, $m_val);
{ local $/ = '\n\n\n';
  while (<$fh>) {                   # read until three newlines
     foreach (split/\n/) {          # split it into individual lines
        if (s/\bProcessing\s\d+\.tx\.\d+: //) {
            print "$_\n";           # print the u_value immediately
        } elsif (s/\bPhrase: //) {
            s/"//g;                 # note: no need to escape "
            $p_val = $_;            # keep it
        } elsif (/\(\\s\\s\)/) {
            $m_val = $_;            # keep it
            if (defined $p_val) {
               print "\t$p_val\t$m_val\n";  # print if we have both
            }
            ($p_val, $m_val) = (undef)x2;
        }
     }
  }
}
close $fh;
__END__
Pulmonary embolism at the time of hip replacement
        Pulmonary embolism      (\s\s)1000 D0076131:PULMONARY EMBOLISM
+ Disease or Syndrome
        of hip replacement      (\s\s)1000 D0554893:HIP REPLACEMENT (S
+TATUS POST HIP REPLACEMENT) Finding
[download]

In reply to Re^3: tab delimited extraction, formatting the output by hbm
in thread tab delimited extraction, formatting the output by zzgulu

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.