in reply to Re^7: tab delimited extraction, formatting the output
in thread tab delimited extraction, formatting the output

mmm, I guess that was not the problem. The text file that I am prcessing is very larg, it has over 8 million lines and the utterance that scripts aborts at(without any error message)is at line 43,100. The only unusual thing that I noticed first near the last utterance processed was that the column in front of p ($p_value) was empty. The last utterance was "History of present illness:" my nlp software (MMTX) prases it into three phrases : "History","of present illness" and "" which is actaully the punctuation":" First I thought ":" is the problem but apparently it is not since there are multiple instances before this one from the very first lines. Do you think this is a size/buffer issue?
  • Comment on Re^8: tab delimited extraction, formatting the output

Replies are listed 'Best First'.
Re^9: tab delimited extraction, formatting the output
by kennethk (Abbot) on Feb 12, 2009 at 19:20 UTC
    It's possible, though the posted code shouldn't be caching any information. You can make sure it's not an issue with Text::CSV by running the code after removing the contents of the while loop.
      can someone help me understand why in the following code $mc_value being printed a line after $p_value and not infront of it with a tab distance of $p_value? Removing "\n" didn't help.
      Also, if I want to send the ouput to a file, should I put OUT infront of every print function? Thank you so much for your hints

      #!/usr/bin/perl use strict; use warnings; my $file = "c:/ubuntu/regular.txt"; #open OUT,">C:/output/filded_processed.txt"; open my $fh, "<", $file or die "Unable to open $file: $!"; my($u_value, $p_value, $mc_value) = (undef) x 3; while (my $line=<$fh>) { if ($line=~/\n\n\n/){ ($u_value, $p_value, $mc_value) = (undef) x 3; print "\n"; } elsif ($line=~/\bProcessing\s/) { $line=~s/\bProcessing\s\d+\.tx\.\d+: //; $u_value = $line; print "\n$u_value\n"; undef $p_value; } elsif ($line=~/\bPhrase/) { $line=~s/\bPhrase: //; $line=~s/\"//g; if ($p_value) { print "\n" . ' ' x length $u_value;} $p_value=$line; print "\t$p_value"; undef $mc_value; } elsif ($line=~/\s\s/ ) { if ($mc_value) { print "\t" . ' ' x length $p_value;} $mc_value=$line; print "\t$mc_value"; } else { #die "Unexpected line format encountered, $file, @data"; } } close $fh;
        why ... $mc_value being printed ... after $p_value and not infront ...

        That's the order you are printing in, isn't it?

        if ($mc_value) { print "\t" . ' ' x length $p_value;} $mc_value=$line; print "\t$mc_value";

        should I put OUT infront of every print

        Yes, or you can select OUT; before your first print statement. That selects OUT as the default filehandle for output.

        Another suggestion. You have instances like this:

        } elsif ($line=~/\bProcessing\s/) { $line=~s/\bProcessing\s\d+\.tx\.\d+: //; $u_value = $line;

        Which could be simplified:

        } elsif ($line=~s/\bProcessing\s\d+\.tx\.\d+: //){ $u_value = $line;

        And you could simplify throughout with $_ instead of $line, but that's a matter of preference.

        I apologize to bug you again, how can I print $p_value only if there is $mc_value in front of it? I tried (if $mc_value ne "") before print $p-value but obviously it doesn't work since the value of $mc_value is assigned after I print $p_value...