Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Pattern matching when there are exception strings

by Moron (Curate)
on Sep 21, 2005 at 12:52 UTC ( #493761=perlquestion: print w/replies, xml ) Need Help??

Moron has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a set of search strings e.g. ALPHA, BETA, GAMMA plus a set of exceptions e.g. for ALPHA: "_ALPHA", "#ALPHA", "XL5 ALPHA" with a different set of exceptions for each search string. Note also that the search string may be any substring of one of its exceptions, not just a right substring as in this example.

I have a top directory to a tree containing years of source code and want to scan the lot and produce a report where each line is in the form:

<file>|<line no.>|<text>

where the text at that line in that file has met the above search criteria (i.e. matches bar the exceptions).

Without the exceptions requirement it was easy enough to recurse through the tree opening files and matching patterns against their contents - it's straightforward enough that I have to search each line multiply for the search strings.

But I seem to have a problem arranging the logic when it comes to eliminating the exceptions. For example:

"I AM ALPHA AND #ALPHA"

It contains both an exception and a non-exception so has to pass the criteria for a successful ALPHA search.

Any ideas from pattern-matching gurus would be more than welcome!

Best wishes,

the Moron

Update: here is the code so far:

#!/usr/bin/perl my $start = '/home/moron/ac5'; Init(); Traverse( $start ); close (my $sfh=$_{ FH }{ SUMMARY }); sub Traverse { my $dir = shift; opendir my $dh, $dir or die "$! while opening $dir"; for my $file ( grep !/^\./, readdir $dh ) { my $full = "$dir/$file"; if ( -d $full ) { Traverse( $full ); } else { Process( $full ); } } closedir $dh or die "$! while closing $dir"; } sub Process { my $file = shift; my $itref = $_{ IT }; my $found = 0; open my $fh, "<$file" or die "$! while opening $file"; while( <$fh> ) { KEY: for my $srch ( keys %$itref ) { if ( SmartSearch( $srch ) ) { print join( '|', ( $file, $., $_ ) ) . "\n"; $found++; last KEY; } } } close $fh or die "$! when closing $file"; if ( $found ) { my $sfh = $_{ FH }{ SUMMARY }; print $sfh "$file|$found\n"; } } sub SmartSearch { my $srch = shift; # and this is where I am stuck, except for some horrible ideas of +iterating through the combinations of cases such as exception in fron +t, exception behind, exception alone on the line } sub Init { $_{ IT }{ ALPHA } = 1; $_{ IT }{ BETA } = 1; $_{ IT }{ GAMMA } = 1; open my $sfh, ">SummaryFile.dat"; $_{ FH }{ SUMMARY } = $sfh; $_{ EXC }{ ALPHA }{ '_ALPHA' } = 1; $_{ EXC }{ ALPHA }{ '#ALPHA' } = 1; $_{ EXC }{ ALPHA }{ 'CAW5 ALPHA' } = 1; }
Free your mind

Replies are listed 'Best First'.
Re: Pattern matching when there are exception strings
by Roy Johnson (Monsignor) on Sep 21, 2005 at 13:31 UTC
    split on your exceptions and grep for your matches:
    my $exceptions = qr/_ALPHA|#ALPHA|XL5 ALPHA/; my $keepers = qr/ALPHA|BETA/; print "Matched ($srch)!\n" if grep /$keepers/, split $exceptions, $src +h;

    Caution: Contents may have been coded under pressure.

      Note that this solution, the other one you provided, and my suggestion to use s///g on exceptions all have the flaw that they won't work if good strings can overlap bad ones. (I.e. if _ALPHA is an exception, but _ALPHALPHA should be a good match.)

      -sauoq
      "My two cents aren't worth a dime.";
      
        Strictly speaking, it wasn't specified whether an overlapping good and bad match is good or bad. As it happens alphabetical superstrings are always exceptions, when there is no whitespace breaking them up and I need to add this in - it also illustrates why it's nasty for a regexp:

        #more exceptions to all /MATCH/ strings: /\w+MATCH\W+/ or /\W+MATCH\w+/ or /\w+MATCH\w+/ ...

        Er, except that of course it doesn't like \W so anyway I'll have to borrow from some other solutions here which use the local negation operator.

        thanks for your input,

        the Moron

        Free your mind

Re: Pattern matching when there are exception strings
by Roy Johnson (Monsignor) on Sep 21, 2005 at 13:41 UTC
    Match unless the line is made up entirely of exceptions and non-matches:
    my $exceptions = qr/_ALPHA|#ALPHA|XL5 ALPHA/; my $keepers = qr/ALPHA|BETA/; while (my $srch = <DATA>) { print "Matched ($srch)!\n" unless $srch=~/^(?:$exceptions|(?!$keeper +s).)*$/; } __DATA__ one with #ALPHA and good ALPHA one with just XL5 ALPHA one with just BETA one with _ALPHA and BETA

    Caution: Contents may have been coded under pressure.
Re: Pattern matching when there are exception strings
by Moron (Curate) on Sep 21, 2005 at 14:13 UTC
    Although by comparison lookahead and behind might get better performance because of less passes per line of text, sauoq's idea of simply removing the exceptions, which became just a few lines inserted into the orginal code still managed to performed very well indeed (still only a second or two through the whole tree). Notice that $_ is getting all the exceptions removed from it cumulatively, i.e. including for other search strings, but that doesn't do any damage given that the parameters of the problem already preclude one search string being a substring of another by reductio ad absurdum.
    sub SmartSearch { my $srch = shift; my $eref = $_{ EXC }{ $srch }; for my $exc ( keys %$eref ) { s/$exc//g; } return /$srch/; }
Re: Pattern matching when there are exception strings
by tomazos (Deacon) on Sep 21, 2005 at 14:54 UTC
    This is the combination of japhy's solution with mine reposted here for completeness:

    my (@prefixs, @suffixs); for (@exclusions) { my ($p, $s) = split /ALPHA/, $_, 2; push @prefixs, $p; push @suffixs, $s; } $total_raw = "(?!" . # Negative Lookahead: Any exceptional ALPHA "(" . join ("|", # Form (a|b|c) out of exceptional ALPHAs map ( # Form each exceptional ALPHA i # Lookbehind for prefix i "(?<=\Q$prefixs[$_]\E)" . "ALPHA" . # Lookahead for suffix i "(?=\Q$suffixs[$_]\E)", # where i = 0 to number of exceptions (0..$#exclusions) ) ) . ")" . ")" . . # Then match an ALPHA normally. # (We have already looked ahead and # confirmed that it is not exceptional) "ALPHA"; $total_rx = qr/$total_raw/;

    $total_rx will only match a non-exceptional ALPHA, without modifying the string your searching on, and it doesn't forget any weird cases. It is a thing of beauty. :)

    Enjoy,

    -Andrew.


    Andrew Tomazos  |  andrew@tomazos.com  |  www.tomazos.com
Re: Pattern matching when there are exception strings
by pboin (Deacon) on Sep 21, 2005 at 13:02 UTC

    You can have a negation class before your matching text. In most RegEx diatlects, classes are in square brackets, and the caret symbol negates the whole class.

    So, in order to match 'ALPHA' with anything *but* and underscore or a hash, you'd have something based on: [^#_]ALPHA.

      That wouldn't match the string "ALPHA, I AM" ... also, a character class can't deal with the "XL5 ALPHA" exception..

      .. which leads what to the author is looking for -- the zero-width negative look-behind assertion "(?<!pattern)" (see perlre).
      use strict; use warnings; while(<DATA>){ print /(?<!XL5 )(?<![#_])ALPHA/ ? "OK\n" : "NOT OK\n"; } __DATA__ I AM ALPHA AND GOOD ALPHA I AM ALPHA AND BETA I AM ALPHA AND #ALPHA I AM ALPHA AND _ALPHA I AM ALPHA AND XL5 ALPHA I AM #ALPHA I AM _ALPHA I AM XL5 ALPHA __OUTPUT__ OK OK OK OK OK NOT OK NOT OK NOT OK

        You're right, and I'll take my lumps for being wrong.

        But what I haven't quite figured out is *why* [^_#]ALPHA wouldn't match on 'ALPHA, I AM'. Is it because the character class has to match _some_ character? My train of thought was that the negation would be primary. "Is there an underscore or a hash leading?" "No."

      That wouldn't match "ALPHA" on a line by itself; it demands a non-# non-_ character before the ALPHA. If you do it this way you probably want a zero-width assertion.

      My first thought is to do a global replacement of all the exceptions, to remove them from the data. Then search what's left for the strings you do want to count. Performance-wise I don't know how great this would be, but it's easier to read (and possibly easier to write if it needs to be automated for a large amount of search strings).

      while(<>){ s/(?:_|#|XL5 )ALPHA//g; print if /ALPHA/; }

      edit: davidrw beat me to it.

        Splicing out things as you suggest would leave the possibility (however remote) that a false match would be created by the pasted-together remnants. Like if you removed "XL5 ALPHA" from "ALXL5 ALPHAPHA". See my split-and-grep recommendation for a similar technique that isn't subject to the pasting problem.

        Caution: Contents may have been coded under pressure.
Re: Pattern matching when there are exception strings
by sauoq (Abbot) on Sep 21, 2005 at 13:26 UTC

    Are your search strings really always surrounded by whitespace (and is that really in contrast to your exceptions?) If so, you might get away with something as simple as /\sALPHA\s/. Or maybe /(^|\s)ALPHA\s/ if your string could begin a line.

    -sauoq
    "My two cents aren't worth a dime.";
    
      No, there are no limitations on the search strings that are relevant to finding a solution, other than that they always match the regexp \w+.

      Thanks,

      The Moron

      Free your mind

        You might consider removing your exceptions with s///g and seeing if you have your search string left over. Not an efficient or clean solution to be sure, but it'll work. If you can't find another hook in your spec, it's a messy problem.

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: Pattern matching when there are exception strings
by japhy (Canon) on Sep 21, 2005 at 13:28 UTC
    I think what I would do is build a list of the prefixes and suffixes around the exclusions of 'ALPHA' -- in your case, @prefixes would be ("_", "#", and "CAW5 ") and @suffixes would be (). Then I'd produce a regex from those:
    my (@pre, @suf); for (@exclusions) { my ($p, $s) = split /ALPHA/, $_, 2; push @pre, $p if length $p; push @suf, $s if length $s; } my $pre_rx = join "", map "(?<!\Q$_\E)", @pre; my $suf_rx = join "", map "(?!\Q$_\E)", @suf; my $total_rx = qr/${pre_rx}ALPHA$suf_rx/;
    Now you can match your strings against the pattern in $total_rx.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      Excuse me for being thick but suppose you had the exceptions '1ALPHA2' and '3ALPHA4'. Wouldn't $total_rx also exclude '3ALPHA2' and '1ALPHA4'? :) Or am I missing something?

      Having said that I still think that lookahead and lookbehind are the best bet.

      Just join them like this instead:

      $total_raw = "(" . # WRONG join ("|", map ( "(?<!\Q$pre[$_]\E)ALPHA(?!\Q$suf[$_]\E)", (0..$#pre) ) ) . ")"; $total_rx = qr/$total_raw/;

      (untested) Or something like that.

      Update: Dooh. That doesn't work. Need an "and" match not an "or" match. Hmmmm... back to the drawing board.

      Update: Okay, I've got it...

      Can you nest lookaheads and lookbehinds? If so this should work:

      $total_raw = "(?!(" . # CORRECT, MAYBE join ("|", map ( "(?<=\Q$pre[$_]\E)ALPHA(?=\Q$suf[$_]\E)", (0..$#pre) ) ) . "))ALPHA"; $total_rx = qr/$total_raw/;

      This has a zero-width assertion followed by ALPHA.

      The zero-width assertion is a lookahead exclude (ie looking ahead, that this is not true)

      What it is looking ahead to see is one of multiple patterns.

      Each pattern contains an ALPHA with a lookahead and behind for an excluded suffix and prefix pair. If one of these matches, we have an excluded alpha - so we exclude this match (the original zero-width assertion).

      So it says: match an ALPHA that does not have an excluded prefix/suffix.

      -Andrew.


      Andrew Tomazos  |  andrew@tomazos.com  |  www.tomazos.com
        Ah, right. Very good point. I wasn't thinking straight.

        Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
        How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Pattern matching when there are exception strings
by Moron (Curate) on Sep 22, 2005 at 12:41 UTC
    Finally I managed to reduce the exceptions into generic rules, although that was more a question of analysing the requirement. Although this code uses 5 pattern matches one after another (the program is a one-off but might be needed again) I do recognise that it could be reduced into a single regexp for performance purposes using some of the suggestions made above, but I have good output now and need to move onto the next step of the real work, which is to sort out which programmers get to fix which obsolete identifiers, these being what the progam has now successfully searched and found.

    A hearty thank-you to all those who contributed!

    sub SmartSearch { my $srch = shift; s/\w+\s*${srch}//g; s/\#$srch//g; s/$srch\s*\w+//g; s/$srch\#//g; return /$srch/; }

    -M

    Free your mind

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://493761]
Approved by lidden
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2023-12-01 19:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?











    Results (5 votes). Check out past polls.

    Notices?