Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Self-Shrinking Character Classes

by athomason (Curate)
on Apr 12, 2001 at 06:38 UTC ( [id://71932]=perlquestion: print w/replies, xml ) Need Help??

athomason has asked for the wisdom of the Perl Monks concerning the following question:

This question's title is perhaps a bit deceptive and/or confusing, because I'm not entirely sure how to classify this problem. For background, the purpose of this script is to match a user-specified pattern against a wordlist, but with a twist. The caveat is magic behavior for the any-character-matching-symbol '.'. Any periods in the pattern should substitute to a previously entered character class (e.g. [prl]). Now, if this same class were substituted for each period, the same letter could potentially match in more than one spot. The question is how to make this list act like a Scrabble bag: once a character is used, it shouldn't be used for a subsequent match. Clearly, this isn't a typical regexp feature, but I'm wondering if it's possible to combine other features for similar functionality.

I'll give an example for clarification. Say I receive the user pattern ^.e..$ and the letter bag 'prl' once again. The straight substitution would give /^[prl]e[prl][prl]$/. But I'd like the regexp to match 'perl' and 'lerp' without matching 'lerr' or 'pepp'. That is, matching a letter in any wildcard spot eliminates it from consideration in later wildcard matches.

The clean and simple solution would be the ability to perform set operations on character classes, e.g. /^[prl]e[{prl}-{\1}][{prl}-{\1\2}]$/. Sadly, those subtraction bits are made up, since I don't know of any real way to do this.

Right now I'm faking it by putting negative lookahead assertions for previous matches in front of subsequent instances of the character class. The example would then look like /^([prl])a(?!\1)([prl])(?!\1|\2)([prl])$/. However, this has (at least) two problems. First, the previously matched characters aren't actually removed from later classes, which is inefficient. Second, quantifiers don't work correctly, as /^([prl])e(?!\1)([prl])+$/ will incorrectly match on 'pell'. The second problem is obviously more bothersome than the first, but both are issues I'd like to resolve. An auxilliary problem with using character classes is that duplicates in the letter bag are immediately quashed. That's actually an even bigger problem for me, but the solution seems farther away.

So, is there a better way to do this? The only full solution I can think of would start using (??{}) assertions, and I'm not eager to start down the self-evaluating pattern path. I just re-read Camel 3's regexp chapter looking for the set operations or equivalent hack, but I can't find one. And sadly, I don't have a copy of Mastering Regular Expressions around for its ninja magic. As always, any insights would be appreciated.

Replies are listed 'Best First'.
Re: Self-Shrinking Character Classes
by chipmunk (Parson) on Apr 12, 2001 at 07:35 UTC
    This is similar to the problem merlyn solves in pat - find words by matching pattern (for crypto). You could definitely adapt his approach to creating a regex containing negative lookahead. By the way, I found merlyn's script to be quite efficient.
Re: Self-Shrinking Character Classes
by Trimbach (Curate) on Apr 12, 2001 at 07:54 UTC
    ...or, for a less subtle, more brute-force method:
    @words = qw (perl lerp lepp wombat); $pattern = ".e.."; @pattern_elements = split "", $pattern; word: foreach $word (@words) { my @bag = qw (p r l); # Reset bag my @letters = split "", $word; print "\nTesting $word...\n"; for my $i(0...$#letters) { my $element = $pattern_elements[$i]; # Create a character class with the letter bag my $class = '['.join ("", @bag).']'; if ($letters[$i] =~ m/($class)/) { # If matched, remove the letter from the bag @bag = grep /[^$1]/, @bag; } elsif ($letters[$i] eq $element) { next; } else { print "$word not matched\n"; next word; } } print "$word matched!\n" if @bag == 0; }
    yields:
    Testing perl... perl matched! Testing lerp... lerp matched! Testing lepp... lepp not matched Testing wombat... wombat not matched
    Not sure how efficient it is, or even how robust it is, but heck, it looks like it does what you want. It tracks each letter in your search string against each character in a search pattern, makes a match (if possible) and shrinks the bag by the match. Hope this helps.

    Gary Blackburn
    Trained Killer

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://71932]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-03-28 08:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found