hillard has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a piece of code that is parsing through lines which are each a distinct sentence. I need to search each sentence for various words, and then print out the sentence to an output file if it contains the specified word. Here's what I have so far (looking for the word 'so'):
$string = "@current_sentence"; if( $string =~ /.*\bso\b.*/ ) { print $string; } #input: <s> so we can do it again yeah yeah </s> #output: <s> so
Where @current_sentence is an array consisting of the words in the sentence (which were stripped from a database file). The print will eventually be redirected to an output file, but in the meantime I just let it go to STDOUT. The problem is that the print statement is only printing out the beggining until the word 'so' and not the entire line. I'm not sure what the problem is. Thanks for any help.

Replies are listed 'Best First'.
Re: Why doesn't the whole line print?
by blakem (Monsignor) on Aug 18, 2001 at 03:44 UTC
    Looks OK to me.. have you tried printing the string before you do the match. Or perhaps the values of @current_sentence.. I don't se any immediate error in your code. Try something like:

    print "current_sentence element: $_\n" for (@current_sentence); $string = "@current_sentence"; print "String Before: $string\n"; if( $string =~ /.*\bso\b.*/ ) { print "String After: $string\n"; }

    -Blake

      Yep, I did that and the whole line prints. That's why I'm stumped, it seems as if the regex is clipping the string...
        You'll have to post a bit more code then, because my demo program doesn't suffer from the same problem:

        #!/usr/bin/perl my @current_sentence = qw(so we can do it again yeah yeah); print "current_sentence element: $_\n" for (@current_sentence); $string = "@current_sentence"; print "String Before: $string\n"; if( $string =~ /.*\bso\b.*/ ) { print "String After: $string\n"; }
        OUTPUT

        current_sentence element: so current_sentence element: we current_sentence element: can current_sentence element: do current_sentence element: it current_sentence element: again current_sentence element: yeah current_sentence element: yeah String Before: so we can do it again yeah yeah String After: so we can do it again yeah yeah

        Although I don't really like the $scalar = "@array" construct, I see nothing wrong with it. I would probably use: $scalar = join(' ',@array) instead for clarity.

        -Blake

Re: Why doesn't the whole line print?
by runrig (Abbot) on Aug 18, 2001 at 03:51 UTC
    See Death to Dot Star!. There is no reason for the '.*' in your regex. All you need is:
    print $string if $string =~ /\bso\b/;
    As for your problem, all I can say is, how are you setting '@current_sentence', and is it what you expect when you assign it to '$string'? Print both out to see if they are what you think they are.
      Sorry for the '.*' I put it in hoping that the whole line would be matched and not truncated. As far as the @current_sentence, I am just pushing words onto the array as I strip them from a data base. In the example I gave 'so' is the first word in the sentence, so I also added a '<s>' to the word I pushed, here is the code:
      push @current_sentence, "<s> $current_word ";
      Then I push additional words onto the array as I find them with this:
      push @current_sentence, "$current_word ";
      Finally the last word of a sentence gets pushed like so:
      push @current_sentence, "$current_word <\/s>\n";
      Another interesting comment, I just tried printing the string outside the if statement and the whole thing printed! But earlier I tried to get around this by having the if statement set a flag and then printing from another if statement that checked the flag... and it still printed truncated. So I still can't print conditionally.

      Thanks for all the help so far.

      An update: I have looked at the behavior on some other examples and it seem as though the print statement is printing the array element that was originally pushed onto the array, even though I put the array into a single string. What I mean be this is that if the word starts a sentence, the <s> tag is ahead of it, if it is in the middle there are no other tags, and if it was at the end a </s> tag follows it. Baffeling to me!

Re: Why doesn't the whole line print?
by hillard (Acolyte) on Aug 18, 2001 at 04:47 UTC
    Ok here's the code...
    #!/g/rcs/sw/bin/perl -w # # Create clusters of meeting data in a semi-unsupervised fashion # # parameters: # 1) input directory # 2) output data #################################################################### use strict; my $indir = $ARGV[0]; my $outfile = $ARGV[1]; opendir(INDIR, $indir) || die "directory open failed"; open(OUTPUT, ">$outfile"); my @files = readdir(INDIR); #get all the files in the directory shift @files; shift @files; #shift off '.' and '..' foreach my $file (@files) { #process files until none are left my $slash = '/'; my $infile = $indir.$slash.$file; open(INPUT, $infile) || die "error on file open"; my $within_spurt=0; my $previous_interupt = 0; my $first = 1; LINE: while (<INPUT>) { #inner loop to create initail clustering if( $first ) { $first = 0; # skip first line which isn't data next LINE; } my @line = split(' '); my $current_word = $line[2]; my $word_in_spurt = $line[3]; my $spurt_length = $line[4]; my $primary_speaker = $line[9]; my $interupting_speakers = $line[11]; my @current_spurt; #@debug = ($line[0], $line[1], $line[3], $line[4], $line[5]); # see if number of speakers increases and first_speaker == 0, # this means the spurt is interupting, not the primary speaker if($primary_speaker == 0 && $interupting_speakers > $previous_interupt && $word_in_spurt == 1) { push @current_spurt, "<s> $current_word ";# start marker $within_spurt = 1; if($word_in_spurt == $spurt_length) {#only word in spurt push @current_spurt, "<\/s>\n"; $within_spurt = 0; $interupting_speakers = 0; # the interupt is over } } # if we are in a spurt and it is not the last word of that spurt elsif($within_spurt && $word_in_spurt != $spurt_length) { push @current_spurt, "$current_word "; #add current word } # if this is the last word of a spurt elsif($word_in_spurt == $spurt_length && $within_spurt == 1) { push @current_spurt, "$current_word <\/s>\n"; #end marker $within_spurt = 0; $interupting_speakers = 0; # the interupt is over } #make sure that at the end of spurts this flag is reduced if($word_in_spurt == $spurt_length) { $interupting_speakers--; # should not be less than 0 if($interupting_speakers < 0) { $interupting_speakers = 0; } } $previous_interupt = $interupting_speakers; my $yeah = 0; my $string = join('',@current_spurt); if( $string =~ /\bso\b/ ) { $yeah = 1; } print STDOUT $string if $yeah; print $string; print OUTPUT @current_spurt; undef @current_spurt; undef $string; } close(INPUT); last; }
    For the line :

    < s > so we can do it again yeah yeah < /s >

    The result is :

    < s > so < s > so we can do it again yeah yeah < /s >

      since no single line contains the sentance, you don't need to be doing the test inside the while loop.

      @current_spurt always has only one word in it.

      just have another array called @my_sentance, and push @current_spurt on it. after the while loop where you read and process the file, do your if /\bso\b test.

      I modified:

      print STDOUT $string if $yeah; print $string; # into print STDOUT $string if $yeah; print $string, "|$.|"; # $. is the current line input number, see perlvar for more # and got bash-2.05$ perl test.org.pl ./data/ out |2||3||4|<s> so <s> so |5|we |6|can |7|do |8|it |9|again |10|yeah |11| +yeah </s> |12||13|bash-2.05$ bash-2.05$

       
      ___crazyinsomniac_______________________________________
      Disclaimer: Don't blame. It came from inside the void

      perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

      As crazyinsomniac says, move the test out of the while loop becuase you don't have the full sentence. I went ahead and made some more changes. Got rid of some variables that were not needed, some stylistic changes as well, and two uses of grep. More can be done, but that's the once over for me

      #!/usr/bin/perl -w # # Create clusters of meeting data in a semi-unsupervised fashion # # parameters: # 1) input directory # 2) output data #################################################################### use strict; my $indir = shift @ARGV or die "No indir supplied"; my $outfile = shift @ARGV or die "No outfile supplied"; die "$indir not a directory!" unless -d $indir; open(OUTPUT, ">$outfile") || die "Could not open $outfile"; die "Could not chdir $indir" unless chdir($indir); opendir(INDIR, ".") || die "directory open of $indir failed"; # I prefer the chdir instead of using the slash for reading the filena +mes my @files = grep(!/^..?\z/, readdir(INDIR)); #get all the files in the + directory # I'm not sure if readdir will always give you . and .. first # See the readdir function in the docs. That's my prefered method from + the camel closedir(INDIR); #just being neat foreach my $infile (@files) { #process files until none are left next if -d $infile; open(INPUT, $infile) || die "error on file opening $infile"; my $within_spurt=0; my $previous_interupt = 0; my @current_spurt; <INPUT>; # skip first line which isn't data while (<INPUT>) { #inner loop to create initail clustering my (undef, undef, $current_word, $word_in_spurt, $spurt_length, undef, undef, undef, undef, $primary_speaker, undef, $interupting_speakers) = split(' '); # see if number of speakers increases and first_speaker == 0, # this means the spurt is interupting, not the primary speaker if ($primary_speaker == 0 && $interupting_speakers > $previous_interupt && $word_in_spurt == 1) { push @current_spurt, "<s> $current_word ";# start marker $within_spurt = 1; if($word_in_spurt == $spurt_length) {#only word in spurt push @current_spurt, "<\/s>\n"; $within_spurt = 0; $interupting_speakers = 0; # the interupt is over } } # if we are in a spurt and it is not the last word of that spurt elsif($within_spurt && $word_in_spurt != $spurt_length) { push @current_spurt, "$current_word "; #add current word } # if this is the last word of a spurt elsif($word_in_spurt == $spurt_length && $within_spurt == 1) { push @current_spurt, "$current_word <\/s>\n"; #end marker $within_spurt = 0; $interupting_speakers = 0; # the interupt is over } #make sure that at the end of spurts this flag is reduced if($word_in_spurt == $spurt_length) { $interupting_speakers--; # should not be less than 0 if($interupting_speakers < 0) { $interupting_speakers = 0; } } $previous_interupt = $interupting_speakers; } close(INPUT); if( grep { /\bso\b/ } @current_spurt) {; print STDOUT @current_spurt; print OUTPUT @current_spurt; } }
      Here is the part of the data file that will give the data needed, I can't technically give out the file, it is not allowed to be public yet, crazy legal stuff..
      mr001 c0 right 1 1 40.352 0.27 r:15_ay:9_t:3 2.26 0 0 1 1 0 0 0 0 1 1 +0 0 1 0 0 0 0 0 0 0 0 0 mr001 c0 so 1 3 76.197 0.21 s:13_ow:8 35.575 0 0 0 0 1 0 0 0 1 1 0 0 0 + 1 0 0 0 1 0 0 0 0 mr001 c0 go 2 3 76.407 0.1 g:3_ow:7 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 1 0 0 0 mr001 c0 ahead 3 3 76.507 0.12 ax:3_hh:3_eh:3_d:3 0 0 0 1 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 mr001 c0 so 1 8 80.301 0.06 s:3_ow:3 3.674 0 0 1 0 1 0 0 0 1 1 0 0 0 1 + 0 0 0 0 0 0 0 0 mr001 c0 we 2 8 80.361 0.06 w:3_iy:3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 1 0 0 0 mr001 c0 can 3 8 80.421 0.25 k:18_ax:4_n:3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 mr001 c0 do 4 8 80.671 0.06 d:3_uw:3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 mr001 c0 it 5 8 80.731 0.13 ax:10_t:3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 mr001 c0 again 6 8 80.861 0.22 ax:7_g:4_eh:3_n:8 0 0 1 1 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 mr001 c0 yeah 7 8 81.315 0.15 y:7_ae:8 0.234 0 0 1 1 0 0 0 0 0 0 0 0 1 + 0 0 0 0 0 0 0 0 0 mr001 c0 yeah 8 8 81.465 0.29 y:4_ae:25 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 +0 0 0 1 0 0 0 0 mr001 c0 i'm 1 4 100.044 0.16 ay:11_m:5 18.289 0 0 0 0 0 0 0 0 1 1 0 0 + 0 0 0 0 0 1 0 0 0 0