snowy has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I have two text files, one containing a list of words (one per line) and one containing as bunch of sentences (one per line). For each word in the word file I want to list all the sentences in which it appears.

I've tried regular expressions and these don't seem to work too well. I have a feeling that

grep
will produce the results I want but am not too sure as to how I go about using it. Does anyone have a code snippet they could show me how to solve my problem?

Thanks in advance.

  • Comment on GREP/Regex - Locating Words in Sentences

Replies are listed 'Best First'.
Re: GREP/Regex - Locating Words in Sentences
by FoxtrotUniform (Prior) on May 02, 2002 at 16:05 UTC
    my @words = qw(foo bar); my @sentences = ( "let's say we have a function called foo", "I'm going to the bar", ); for my $word (@words) { @occurrences = grep /\b$word\b/, @sentences; print "$word occurs in ", join("\n", @occurrences), "\n"; }

    --
    :wq

Re: GREP/Regex - Locating Words in Sentences
by gmax (Abbot) on May 03, 2002 at 08:59 UTC
    Although the previous answers have explained the technique, I think that there are some pitfalls that a full example will show better.

    The example below works fine, provided that the sentences are not too many (i.e., if you have enough memory to take them all into memory).
    #!/usr/bin/perl -w use strict; open WORDS, "< words" or die "cant' open words"; my (@words, @sentences); while (<WORDS>) { chomp; push @words, $_; } close WORDS; open SENTENCES, "< sentences" or die "cant' open sentences"; push @sentences, $_ while <SENTENCES>; close SENTENCES; for my $word (@words) { my @found = grep /\b$word\b/, @sentences; if (@found) { print $word, ": \n\t", join "\t", @found; } } __END__ contents of file "words" ------------------------------- first second third fourth ------------------------------- contents of file "sentences" ------------------------------- I am the first I always wanted to be the first I never liked to be second I second your request I will never appear in the output Better second than third ------------------------------- program's output ------------------------------- first: I am the first I always wanted to be the first second: I never liked to be second I second your request Better second than third third: Better second than third
    In this example, I have "slurped" into memory all the words and all the sentences. This is due to the requirements that the matching sentences should be shown for each word, and that each sentence could belong to more than one word.
    I have the feeling that in a real life situation you could not afford the "slurp" luxury. If this is the case, then you need either a database engine or an algorithm that will read the words first, then store the matching lines as file addresses into a hash, and finally for each word retrieve the matching lines using the stored addresses.

    Notice that the if you want to show the results in the opposite way (for each sentence, which words it matches), then you can read all the words (which presumably should fit in memory), do the matching for each sentence you read and print the results immediately.
    #!/usr/bin/perl -w use strict; open WORDS, "< words" or die "cant' open words"; my (@words, @sentences); while (<WORDS>) { chomp; push @words, [$_, qr/\b$_\b/]; } close WORDS; open SENTENCES, "< sentences" or die "cant' open sentences"; while (<SENTENCES>) { my $printed = 0; for my $word (@words) { if (/$word->[1]/) { print $_ unless $printed++; print "\t", $word->[0]; } } print "\n" if $printed; } close SENTENCES; __END__ program's output: ------------------------------- I am the first first I always wanted to be the first first I never liked to be second second I second your request second Better second than third second third
    In this second script, as an additional measure, I coded the words with the qr operator, which compiles them as regular expressions. So the program will run much faster, since the regex for each word is compiled only once.

    Hope these examples give you the elements to solve your problem.
     _  _ _  _  
    (_|| | |(_|><
     _|   
    
Re(Amel): GREP/Regex - Locating Words in Sentences
by dsb (Chaplain) on May 02, 2002 at 16:54 UTC
    You could extend this as necessary to accomodate a longer list of words to be tested.
    $word = shift; %sent = ( one => "This is sentence one. I wonder what the word is.", two => "This would be sentence two.", thr => "Is this sentence three. I think it is. I sure hope it is.", ); @matches = grep $sent{$_} =~ /\b$word\b/, keys %sent; print "The matching sentences are:\n"; for (@matches) { print; print " => ", $sent{$_}, "\n"; }
    Note that grep also makes use of regular expressions(in this case anyway). So realize that it may not have been your regular expressions that weren't working but rather the logic of your code.

    Hope this helps ;0)




    Amel
Re: GREP/Regex - Locating Words in Sentences
by Cyrnus (Monk) on May 02, 2002 at 16:35 UTC
    #!/usr/bin/perl -w use strict; my @words = qw(foo bar); my @lines = ("foo before bar\n", "the word is foo\n", "one goes over the mountain\n", "one reaches the bar\n", "Here foo is in the middle\n", "foo on you\n", "this foobar will be regected\n"); my $istrue = 0; my $line; my $word; foreach $line (@lines) { $istrue = 0; foreach $word (@words) { if ($line =~ /\b$word\b/) { $istrue = 1; } } if ($istrue) { print $line; } }
    John