Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks i have a big file form my results with perl.i want to look for words around a specific word in this file.for example if i have a file with following sentences : i am going home they are going school. sam is going to lunch. how can i find words before and after "going" and save it in a hash. with 200 line program ,now,i am totally confusing and need help.:-( thank you.

Replies are listed 'Best First'.
Re: find words around a word in a file.
by JavaFan (Canon) on Jul 28, 2011 at 09:33 UTC
    The real problem here is to define what a "word" is. /\b\w+\b/ would not capture hyphenated words, and only capture accented characters if the string is in UTF-8 format, or if you use the /u modifier (need 5.14 for that). It also captures digits and underscores. /[a-zA-Z]+/ works fine if all your input data is ASCII. /\pL+/ captures strings of "letter" characters, but can capture strings that combine letters from different scripts.

    And neither of the above will deal with words like don't very well. So, before you ponder how to find a next of a previous word, consider how you find a word.

Re: find words around a word in a file.
by GrandFather (Saint) on Jul 28, 2011 at 10:17 UTC

    What is the bigger problem you are working on? In other words: Why!

    Many of the other relies are trying to elucidate further information to better specify your problem. As presented there are many vague areas in your problem description. To really help we need to know more about the context of your problem. If this is a homework exercise then a simple (\w+) capture will suffice to extract words, but if you need to deal with real world text then you need to look to the lingua modules to parse out words. On the other hand, if you are looking for specific words then a regular expression may be a good tool for the task.

    True laziness is hard work
Re: find words around a word in a file.
by moritz (Cardinal) on Jul 28, 2011 at 09:36 UTC
    There are multiple ways to do that, and the effort depends mostly on how robust you want it.

    For example if the word is proceeded by 10 empty lines, should the code find the last word before the 10 empty lines as context?

    Does isn't count as one word? or two? or something else?

    A very simplistic search could use a regex, and only succeed if the word is not the first or last in a line:

    while (<$FILE>) { if (/(\S+)\s+going\s+(\S+)/) { print "before: '$1'; after: '$2'\n"; } }

    for some very crude definition of "word".

    For a more robust solution you could use App::Ack to give you some lines of context before and after the found word, and then extract the words you need.

    If you need a more robust detection of word boundaries, read up on text segementation.

Re: find words around a word in a file.
by jethro (Monsignor) on Jul 28, 2011 at 09:30 UTC
    Go through the file word by word and remember the previous word. Use a HashofHashes (see perllol) to count your neighbors, i.e.

    $count{$thisword}{$previousword}++; $count{$previousword}{$thisword}++;

    Do this for every word. That's it. Also you have two options: Handle '.' as a distinct word and words over sentence-boundaries won't be neighbors. Or ignore '.' and words will be neighbors even if they are in different sentences

Re: find words around a word in a file.
by happy.barney (Friar) on Jul 28, 2011 at 09:35 UTC
    next time please use <code> tag and also add expected result.
    following code will do something similar to your spec :-)
    my %hash; while (<>) { $hash{$1} = $2 while m/(\w+)\W+going\W+(\w+)/g; } __END__ %hash = ( 'am' => 'home', 'are' => 'school', 'is' => 'to', );
      IMHO using hashes like this is a bad idea, because different $2 for the same $1 will be lost.

      Furthermore looking for non-whitespace and non-punctuation could help practically solving the "what is a word problem".

      my %hash; $whitespace=" \n\t"; $punctuation=".,!?"; $non_delimiters="[^$whitespace$punctuation]"; while (<DATA>) { push @{$hash{$1}}, $2 while m/($non_delimiters+)\s+going\s+($non_del +imiters+)/g; } use Data::Dumper; print Dumper \%hash; __DATA__ I am going home. I am going to bed. What's going on?
      Output:
      $VAR1 = { 'What\'s' => [ 'on' ], 'am' => [ 'home', 'to' ] };
      I'm still not sure if a hash should be used at all, IMHO an array of pairs (two elemnet arrays) is better.
Re: find words around a word in a file.
by Not_a_Number (Prior) on Jul 28, 2011 at 18:09 UTC

    Following GrandFather's advice:

    use strict; use warnings; use Lingua::EN::Bigram; my $text = q{i ain't going home they are going school. sam isn't going to lunch. "Going going gone!" quoth the auctioneer}; my $ngrams = Lingua::EN::Bigram->new; $ngrams->text( $text ); my @trigrams = $ngrams->ngram( 3 ); for ( @trigrams ) { print "$_\n" if ( split )[1] eq 'going'; }

    Output:

    ain't going home are going school isn't going to . going going going going gone