ToasterLeavings has asked for the wisdom of the Perl Monks concerning the following question:

I believe I am breaking my brain. I'm trying to finish off a fairly basic (so I thought) text analysis script, whereby I search a bunch of company annual reports and spit out the number of hits each one contains on several groups of terms which tap dimensions of organizational culture.

These terms come in five main flavours (with corresponding *rough* regexs):
single words '\bteam\b'
single words as stems '\bteam\w*'
phrases '\bstaff participation\b'
paired words with specified maximum separation distance '\bemployee\W+(\w+\W+){0,6}participation\b' & vice versa
phrases/words with 'exclusion terms' of specified minimum separation distance, such as '\bperform\w*' NOT within 6 words of '\bsafety\b'

I've whipped up the IO, pre-processing and sentence splitting bits without much drama, and now find myself knee deep in a weird post-apocalyptic wasteland of closure and regex madness. Basically, to count occurrences of particular terms within a sentence, I use this ...

sub regex2 { my $pat = shift ; eval q%sub { my $hits = '0'; while ($_[0] =~ /$pat/iogx) {$hits++ ;} if ($hits) { return $hits ; } else { return undef() ; } }%; }

... in conjunction with a number of the following ...

my $foobar = regex2 '\bfoo\W+(?:\w+\W+){0,6}bar\b' ;

... and subsequent use of the function provided by &$foobar while iterating over sentences in the report (although this is done via a hash of references to the collection of created subs at present).

I adopted this method from one of the reference documents which I am stupidly now unable to find, but I remember at the time I was mighty excited about it because it offered speedy speedy searchin' goodness in situations where there are a pantsload of repetitive searches to be done. Hell ain't going to freeze over if I'm badly mistaken, and I welcome being set straight regarding my use of this approach if it's pointless.

Now for the real time of troubles. I cannot neatly create regexs for particular forms of 'exclusion term' phrases. As an example, I want to count all the instances of 'team' within a sentence, except for those instances which occur within 5 words of the words 'management'. I don't want to ditch the sentence outright if the exclusion term is found, I just want to ignore that particular instance of term, but count the rest. Using rather hideous zero-width negative lookahead assertions I can skip the case where an exclusion term follows the term of interest (tested, but not exhaustively) - eg:

my $ugly_and_evil = regex2 '\bteam(?!\W+(\w+\W+){0,5}management\b)' + ;

but I can't do the same where the exclusion term comes before the term of interest because the zero-width negative lookbehind assertions are fixed-width!

my $broken_and_wrong = regex2 '(?<!\bmanagement\W+(\w+\W+){0,5})tea +m' ;

this just plain can't work. I need it to dammit. I managed to kludge my way through a zero-width negative lookahead assertion that would ignore things like 'management thing thing thing team', but unfortunately it would then also fail to match any instances of 'team' following that part of the string.

I could use the following which seems to work in the way I need it to:

my $inherently_evil = regex2 '(?<!management) (?<!management.) (?<!management..) (?<!management...) (?<!management....) (?<!management.....) (?<!management......) (?<!management.......) (?<!management........) (?<!management.........) (?<!management..........) (?<!management...........) (?<!management............) (?<!management.............) (?<!management..............) ## ....etc etc etc..... team\b (?!\W+(\w+\W+){0,5}management) ';

but anything that ugly tells me that everything I've done since 'use strict;' has been entirely wrong. If I didn't need to count the dang occurrences I'd be using text::query::advanced, but It would only flag a matching sentence for me. So, is there a humane way to implement the lookbehinds and have them be variable width, or alternatively have I chosen the most insane way to whip up my text analysis thing? Suggestions, abuse, raucous laughter, brain donors?

Replies are listed 'Best First'.
Re: Implementing variable-width negative lookbehind assertions?
by japhy (Canon) on Jun 22, 2001 at 08:17 UTC
Re (tilly) 1: Implementing variable-width negative lookbehind assertions?
by tilly (Archbishop) on Jun 22, 2001 at 07:29 UTC
    First of all your difficulty in implementing the RE is directly reflected in why variable width look-behinds are not done in the RE engine. (Because they are hard!)

    Anyways I would suggest that you play games with pos and the length of the match. Make this a multi-pass problem. The first pass creates a hash of possible positions of matches with the matched text. The further passes create lists of positions excluded for some reason (within 5 words after management) and are used to delete from the hash. Whatever keys survive, are what you are interested in.

    Thinking this through is likely to be a bit messy, but should work and work within a reasonable time. (As long as you don't have too many conditions to worry about.)

Re: Implementing variable-width negative lookbehind assertions?
by Aighearach (Initiate) on Jun 22, 2001 at 17:03 UTC
    Is it possible to move some of your length testing out of the regex?
    while ( /(simple_match)/ ) { $count++ if length($1) >= 6; }
    regexes are hard, my results when I minimize their use have resulted in less maintainance than the times I built big fancy ones. But, sometimes there is no way around it.
    --
    Snazzy tagline here
Re: Implementing variable-width negative lookbehind assertions?
by chip (Curate) on Jun 23, 2001 at 06:50 UTC
    I can offer you some simplification.

    First, the regex2 sub doesn't have to use eval, as the normal closure mechanism combined with //o should give you the semantics you have now. Also, you're going to the trouble of initializing your returned count to zero, then if it stays zero, you're returning undef. Better just to leave it uninitialized and use ++ on it; the results will be as you want. Thus:

    sub regex2 { my $pat = shift; sub { my $count; ++$count while /$pat/iogx; $count }; }

        -- Chip Salzenberg, Free-Floating Agent of Chaos

      Interesting. The subroutine only gets compiled once but the regex gets compiled once for each closure created. That was a surprise to me at first but then the regex can't be compiled when the subroutine gets compiled ($pat isn't set then) so it gets compiled when it is first used (or is it actually compiled when the closure is created?)...

      I thought that the regex state was attached to a node in the parse tree and I'd expect the closures to all be sharing the same node for that regex...

      Care to go into more detail on why/how this works? Was this regex/closure interaction specifically implemented or did it fall out of the general design?

              - tye (but my friends call me "Tye")