tab delimited extraction, formatting the output

zzgulu has asked for the wisdom of the Perl Monks concerning the following question:

I do appreciate your comments/help on this
I have a table that is multi line tab delimitted with records separate by 'EOU'. like below
I want to extract utterance(u),related phrase(s)(p), and related mapped concept(s)(mc)

00000000.tx.1 1 u CHIEF COMPLAINT:
00000000.tx.1 2 p CHIEF COMPLAINT
00000000.tx.2 1 u Mental status change over the last 5 days.
00000000.tx.2 2 p Mental status change
00000000.tx.2 3 c 1 1
00000000.tx.2 4 m 1 1 1000
00000000.tx.2 5 mc 1 1 1 1 1000 MENTAL STATUS CHANGE ALTERED MENTAL STATUS mental,status,change mobd
00000000.tx.2 6 p over the last 5 days
00000000.tx.2 7 c 0 0
00000000.tx.2 8 m 0 0
'EOU'.

my questions are
1. using local $/="'EOU'." in order to introduce my block generates incorrect ouput. How should I use this correctly?
2. The output of my code currently generates

CHIEF COMPLAINT:
CHIEF COMPLAINT

Mental status change over the last 5 days.
Mental status change
over the last 5 days

how can I change the output to generate this format?

CHIEF COMPLAINT: <tab> CHIEF COMPLAINT

Mental status change over the last 5 days. <tab> Mental staus change
................................................................<tab>over the last 5 days

in other words how can I generate an output like this
u<tab>p
and if there are multiple phrases per utterence I get multiple line of phrases

u<tab>p
...........p
...........p

3.How can I add another joined "aggregated fields" infront of "p" if an "mc" field is present.
I mean by adding this line to the code

$field3=join("\t",@data[7,8,9,10,11])
[download]

I could get an output like this
u<tab>p
...........p
...........p<tab>mc

Thanks again for your help
code

#!/usr/bin/perl<br>
$file = "fielded.txt";
open IN, "<", $file  or die "Unable to open $file: $!";

while($line=<IN>)
{
   @data=split(/\t/,$line);
   if ($data[2] eq "u")
   {$field1=$data[3]; print "\n$field1";}
   if ($data[2] eq "p")
   {$field2=$data[3]; print "$field2\n";}
}
close IN;
[download]

Comment on tab delimited extraction, formatting the output Select or Download Code

Replies are listed 'Best First'.
Re: tab delimited extraction, formatting the output by kennethk (Abbot) on Feb 09, 2009 at 19:14 UTC
First and foremost, you'll probably make your life easier in the long run if you open your code with `use strict;use warnings` - it'll catch a host of accidental mistakes. If you are working with tab-delimited fields, it's probably easier to use an existing parser like Text::CSV than rolling your own. Since the end-of-record markers are much rarer than tabs and new lines, you could catch that up front. And if you are concerned with formatting, it's probably easier to use sprintf to get things looking nice. Here's some basic code that (mostly) replicates what you've done, to help you with the CSV prototype: #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $file = "fielded.txt"; my $csv = Text::CSV->new({sep_char => "\t"}); # create a new objec +t open my $fh, "<", $file or die "Unable to open $file: $!"; while (my $data_ref = $csv->getline($fh)) { my @data = @{$data_ref}; if ($data[0] eq "'EOU'.") { # End of record code } elsif ($data[2] eq "u") { print "\n$data[3]" } elsif ($data[2] eq "p") { print "$data[3]\n" } else { #die "Unexpected line format encountered, $file, @data"; } } close $fh; [download] Update: I should point out you've made great strides since block extraction - congratulations.	[reply] [d/l] [select]
Re^2: tab delimited extraction, formatting the output by zzgulu (Novice) on Feb 09, 2009 at 21:21 UTC
Thank you kenneth for your great comment and direction. Apprentaly I didn't have CSV package so I learned how to download and install packages too. I am still looking into your script to understand the logic. In the mean time, I added this line at the end but it messed up the output that I was quite happy with that. `elsif ($data[2] eq "mc") { print join("\t",@data[7,8,9,10,11])` [download] how can I make the output of join to appear exctly in front of its related phrase (p). I also looked at sprintf link you sent, couldn't find anything relevant to what I want to do. Thanks again	[reply] [d/l]
Re^3: tab delimited extraction, formatting the output by kennethk (Abbot) on Feb 09, 2009 at 22:04 UTC
Note that in order to get the formatting, you need to cache the previous string in order to determine the indentation. #!/usr/bin/perl use strict; use warnings; use Text::CSV; #my $file = "fielded.txt"; my $csv = Text::CSV->new({sep_char => "\t"}); # create a new objec +t open my $fh, "<", $file or die "Unable to open $file: $!"; my($u_value, $p_value, $mc_value) = (undef) x 3; while (my $data_ref = $csv->getline($fh)) { my @data = @{$data_ref}; if ($data[0] eq "'EOU'.") { ($u_value, $p_value, $mc_value) = (undef) x 3; print "\n"; } elsif ($data[2] eq "u") { $u_value = $data[3]; print "\n$u_value"; undef $p_value; } elsif ($data[2] eq "p") { if ($p_value) { print "\n" . ' ' x length $u_value; } $p_value = $data[3]; print "\t$p_value"; undef $mc_value; } elsif ($data[2] eq "mc") { if ($mc_value) { print "\n" . ' ' x length $p_value; } $mc_value = join("\t",@data[7 .. 11]); print "\t$mc_value"; } else { #die "Unexpected line format encountered, $file, @data"; } } close $fh; [download]	[reply] [d/l]
Re^4: tab delimited extraction, formatting the output by Anonymous Monk on Feb 09, 2009 at 23:20 UTC
Re^5: tab delimited extraction, formatting the output by kennethk (Abbot) on Feb 09, 2009 at 23:28 UTC
Some notes below your chosen depth have not been shown here
Re: tab delimited extraction, formatting the output by hbm (Hermit) on Feb 14, 2009 at 03:32 UTC
zzgulu, I'm very happy to continue helping. Two things, though: 1) This thread has gotten long enough and pushed so far to the right that I'm responding to your original message. 2) It seems your current code is not in sync with your original data example, so it would be helpful to re-post them. To your last question, how to print one value only if you have another value: You should assemble a record as you process the lines; then undef the record when you get to the end-of-record marker; perhaps printing the record first if it has all the elements. I mocked up the following, but again, it doesn't work with your data sample. Either the data or the expressions need to change. use strict; use warnings; my $file = "z.txt"; open my $fh, "<", $file or die "Unable to open $file: $!"; my %record; while (<$fh>) { chomp; if (/EOU/){ if (exists $record{'u_val'} && exists $record{'p_val'} && exists $record{'m_val'} ) { print "$record{'u_val'}\n", "\t$record{'p_val'}", "\t$record{'m_val'}\n"; } %record = (); } elsif (s/\bProcessing\s\d+\.tx\.\d+: //) { $record{'u_val'} = $_; } elsif (s/\bPhrase: //) { s/\"//g; $record{'p_val'} = $_; } elsif (/\s\s/) { $record{'m_val'} = $_; } } close $fh; [download]	[reply] [d/l]
Re^2: tab delimited extraction, formatting the output by zzgulu (Novice) on Feb 16, 2009 at 18:15 UTC
Thank you hbm for your great tips. Here is the code and output that I get for a sample record. My question was how can I eliminate the second phrase ("at the time") from the output since there is no mapping available for that phrase. #!/usr/bin/perl use strict; use warnings; my $file = "regular.txt.out"; open my $fh, "<", $file or die "Unable to open $file: $!"; my($u_value, $p_value, $mc_value) = (undef) x 3; while (my $line=<$fh>) { chomp $line; if ($line=~/\n\n\n/){ ($u_value, $p_value, $mc_value) = (undef) x 3; print "\n"; } elsif ($line=~/\bProcessing\s/) { $line=~s/\bProcessing\s\d+\.tx\.\d+: //; $u_value = $line; print "\n$u_value\n"; undef $p_value; } elsif ($line=~/\bPhrase/) { $line=~s/\bPhrase: //; $line=~s/\"//g; if ($p_value) { print "\n" . ' ' x length $u_value;} $p_value=$line; print "\t$p_value"; undef $mc_value; } elsif ($line=~/\s\s/ ) { if ($mc_value) { print "\n\t" . ' ' x length $p_value;} $mc_value=$line; print "\t$mc_value"; } else { } } close $fh; [download] sample text for process Processing 00000000.tx.3: Pulmonary embolism at the time of hip replacement Phrase: "Pulmonary embolism" Meta Mapping (1000) (\s\s)1000 D0076131:PULMONARY EMBOLISM Disease or Syndrome Phrase: "at the time" Meta Candidates (0): <none> Meta Mappings: <none> Phrase: "of hip replacement" Meta Mapping (1000) (\s\s)1000 D0554893:HIP REPLACEMENT (STATUS POST HIP REPLACEMENT) Finding output of the processd block is: Pulmonary embolism at the time of hip replacement <tab>Pulmonary embolism<tab>1000 D0076131:PULMONARY EMBOLISM Disease or Syndrome <tab>at the time <tab>of hip replacement<tab>1000 D0554893:HIP REPLACEMENT (STATUS POST HIP REPLACEMENT) Finding	[reply] [d/l]
Re^3: tab delimited extraction, formatting the output by hbm (Hermit) on Feb 16, 2009 at 21:34 UTC
Somewhere along the way, your record separator changed from "EOU" to "\n\n\n". The first, presumably being unique in the data, was simple to test for. The latter is more difficult since you are processing the file one line at a time. I suppose you could keep a counter to see if three lines in a row contain nothing but a newline; but, it would be easier to change the record-separator variable (`$/`) to '\n\n\n'. I've done that below. Also, '(\s\s)' is a glaring string to have in your data! You do know that all those characters have special meaning in a regular expression: `/(\s\s)/` matches and keeps two whitespace characters. So to match them literally, you need to escape them. I've done that below too. (In my previous reply, I did think to myself that `/\s\s/` was not a very definitive expression, but I kept it because you had it.) I think the following does what you want. Note two assumptions: 1) That you always want to print the $u_val, even if there isn't a following $p_val/$m_val pair. 2) That $m_val always follows $p_val. use strict; use warnings; my $file = "z.txt"; open my $fh, "<", $file or die "Unable to open $file: $!"; my ($p_val, $m_val); { local $/ = '\n\n\n'; while (<$fh>) { # read until three newlines foreach (split/\n/) { # split it into individual lines if (s/\bProcessing\s\d+\.tx\.\d+: //) { print "$_\n"; # print the u_value immediately } elsif (s/\bPhrase: //) { s/"//g; # note: no need to escape " $p_val = $_; # keep it } elsif (/$\\s\\s$/) { $m_val = $_; # keep it if (defined $p_val) { print "\t$p_val\t$m_val\n"; # print if we have both } ($p_val, $m_val) = (undef)x2; } } } } close $fh; __END__ Pulmonary embolism at the time of hip replacement Pulmonary embolism (\s\s)1000 D0076131:PULMONARY EMBOLISM + Disease or Syndrome of hip replacement (\s\s)1000 D0554893:HIP REPLACEMENT (S +TATUS POST HIP REPLACEMENT) Finding [download]	[reply] [d/l]
Re^4: tab delimited extraction, formatting the output by zzgulu (Novice) on Feb 16, 2009 at 23:15 UTC
Re^5: tab delimited extraction, formatting the output by Anonymous Monk on Feb 17, 2009 at 14:00 UTC
Some notes below your chosen depth have not been shown here
Re^5: tab delimited extraction, formatting the output by hbm (Hermit) on Feb 17, 2009 at 14:20 UTC