I believe I am breaking my brain. I'm trying to finish off a fairly basic (so I thought) text analysis script, whereby I search a bunch of company annual reports and spit out the number of hits each one contains on several groups of terms which tap dimensions of organizational culture.
These terms come in five main flavours (with corresponding *rough* regexs):
single words '\bteam\b'
single words as stems '\bteam\w*'
phrases '\bstaff participation\b'
paired words with specified maximum separation distance '\bemployee\W+(\w+\W+){0,6}participation\b' & vice versa
phrases/words with 'exclusion terms' of specified minimum separation distance, such
as '\bperform\w*' NOT within 6 words of '\bsafety\b'
I've whipped up the IO, pre-processing and sentence splitting bits without much drama, and now find myself knee deep in a weird post-apocalyptic wasteland of closure and regex madness. Basically, to count occurrences of particular terms within a sentence, I use this ...
sub regex2 { my $pat = shift ; eval q%sub { my $hits = '0'; while ($_[0] =~ /$pat/iogx) {$hits++ ;} if ($hits) { return $hits ; } else { return undef() ; } }%; }
... in conjunction with a number of the following ...
my $foobar = regex2 '\bfoo\W+(?:\w+\W+){0,6}bar\b' ;
... and subsequent use of the function provided by &$foobar while iterating over sentences in the report (although this is done via a hash of references to the collection of created subs at present).
I adopted this method from one of the reference documents which I am stupidly now unable to find, but I remember at the time I was mighty excited about it because it offered speedy speedy searchin' goodness in situations where there are a pantsload of repetitive searches to be done. Hell ain't going to freeze over if I'm badly mistaken, and I welcome being set straight regarding my use of this approach if it's pointless.
Now for the real time of troubles. I cannot neatly create regexs for particular forms of 'exclusion term' phrases. As an example, I want to count all the instances of 'team' within a sentence, except for those instances which occur within 5 words of the words 'management'. I don't want to ditch the sentence outright if the exclusion term is found, I just want to ignore that particular instance of term, but count the rest. Using rather hideous zero-width negative lookahead assertions I can skip the case where an exclusion term follows the term of interest (tested, but not exhaustively) - eg:
my $ugly_and_evil = regex2 '\bteam(?!\W+(\w+\W+){0,5}management\b)' + ;
but I can't do the same where the exclusion term comes before the term of interest because the zero-width negative lookbehind assertions are fixed-width!
my $broken_and_wrong = regex2 '(?<!\bmanagement\W+(\w+\W+){0,5})tea +m' ;
this just plain can't work. I need it to dammit. I managed to kludge my way through a zero-width negative lookahead assertion that would ignore things like 'management thing thing thing team', but unfortunately it would then also fail to match any instances of 'team' following that part of the string.
I could use the following which seems to work in the way I need it to:
my $inherently_evil = regex2 '(?<!management) (?<!management.) (?<!management..) (?<!management...) (?<!management....) (?<!management.....) (?<!management......) (?<!management.......) (?<!management........) (?<!management.........) (?<!management..........) (?<!management...........) (?<!management............) (?<!management.............) (?<!management..............) ## ....etc etc etc..... team\b (?!\W+(\w+\W+){0,5}management) ';
but anything that ugly tells me that everything I've done since 'use strict;' has been entirely wrong. If I didn't need to count the dang occurrences I'd be using text::query::advanced, but It would only flag a matching sentence for me. So, is there a humane way to implement the lookbehinds and have them be variable width, or alternatively have I chosen the most insane way to whip up my text analysis thing? Suggestions, abuse, raucous laughter, brain donors?
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |