find words around a word in a file.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: find words around a word in a file. by JavaFan (Canon) on Jul 28, 2011 at 09:33 UTC
The real problem here is to define what a "word" is. `/\b\w+\b/` would not capture hyphenated words, and only capture accented characters if the string is in UTF-8 format, or if you use the `/u` modifier (need 5.14 for that). It also captures digits and underscores. `/[a-zA-Z]+/` works fine if all your input data is ASCII. `/\pL+/` captures strings of "letter" characters, but can capture strings that combine letters from different scripts. And neither of the above will deal with words like `don't` very well. So, before you ponder how to find a next of a previous word, consider how you find a word.	[reply] [d/l] [select]
Re: find words around a word in a file. by GrandFather (Saint) on Jul 28, 2011 at 10:17 UTC
What is the bigger problem you are working on? In other words: Why! Many of the other relies are trying to elucidate further information to better specify your problem. As presented there are many vague areas in your problem description. To really help we need to know more about the context of your problem. If this is a homework exercise then a simple (\w+) capture will suffice to extract words, but if you need to deal with real world text then you need to look to the lingua modules to parse out words. On the other hand, if you are looking for specific words then a regular expression may be a good tool for the task. True laziness is hard work	[reply]
Re: find words around a word in a file. by moritz (Cardinal) on Jul 28, 2011 at 09:36 UTC
There are multiple ways to do that, and the effort depends mostly on how robust you want it. For example if the word is proceeded by 10 empty lines, should the code find the last word before the 10 empty lines as context? Does `isn't` count as one word? or two? or something else? A very simplistic search could use a regex, and only succeed if the word is not the first or last in a line: `while (<$FILE>) { if (/(\S+)\s+going\s+(\S+)/) { print "before: '$1'; after: '$2'\n"; } }` [download] for some very crude definition of "word". For a more robust solution you could use App::Ack to give you some lines of context before and after the found word, and then extract the words you need. If you need a more robust detection of word boundaries, read up on text segementation. Perl 6 - second systems done right	[reply] [d/l] [select]
Re: find words around a word in a file. by jethro (Monsignor) on Jul 28, 2011 at 09:30 UTC
Go through the file word by word and remember the previous word. Use a HashofHashes (see perllol) to count your neighbors, i.e. `$count{$thisword}{$previousword}++; $count{$previousword}{$thisword}++;` [download] Do this for every word. That's it. Also you have two options: Handle '.' as a distinct word and words over sentence-boundaries won't be neighbors. Or ignore '.' and words will be neighbors even if they are in different sentences	[reply] [d/l]
Re: find words around a word in a file. by happy.barney (Friar) on Jul 28, 2011 at 09:35 UTC
next time please use <code> tag and also add expected result. following code will do something similar to your spec :-) `my %hash; while (<>) { $hash{$1} = $2 while m/(\w+)\W+going\W+(\w+)/g; } __END__ %hash = ( 'am' => 'home', 'are' => 'school', 'is' => 'to', );` [download]	[reply] [d/l]
Re^2: find words around a word in a file. by The Perlman (Scribe) on Jul 29, 2011 at 17:13 UTC
IMHO using hashes like this is a bad idea, because different $2 for the same $1 will be lost. Furthermore looking for non-whitespace and non-punctuation could help practically solving the "what is a word problem". `my %hash; $whitespace=" \n\t"; $punctuation=".,!?"; $non_delimiters="[^$whitespace$punctuation]"; while (<DATA>) { push @{$hash{$1}}, $2 while m/($non_delimiters+)\s+going\s+($non_del +imiters+)/g; } use Data::Dumper; print Dumper \%hash; __DATA__ I am going home. I am going to bed. What's going on?` [download] Output: `$VAR1 = { 'What\'s' => [ 'on' ], 'am' => [ 'home', 'to' ] };` [download] I'm still not sure if a hash should be used at all, IMHO an array of pairs (two elemnet arrays) is better. ---The Perlman	[reply] [d/l] [select]
Re: find words around a word in a file. by Not_a_Number (Prior) on Jul 28, 2011 at 18:09 UTC
Following GrandFather's advice: `use strict; use warnings; use Lingua::EN::Bigram; my $text = q{i ain't going home they are going school. sam isn't going to lunch. "Going going gone!" quoth the auctioneer}; my $ngrams = Lingua::EN::Bigram->new; $ngrams->text( $text ); my @trigrams = $ngrams->ngram( 3 ); for ( @trigrams ) { print "$_\n" if ( split )[1] eq 'going'; }` [download] Output: `ain't going home are going school isn't going to . going going going going gone` [download]	[reply] [d/l] [select]