SuicideJunkie has asked for the wisdom of the Perl Monks concerning the following question:

To the point: what I am looking for is a relatively clean way to say "any front anchored substring of $word" in regex.
If $word='ello', then

my $magicRegex = munge($word); if (/^h$magicRegex$/) {...}
should match 'h','he','hel', etc.


Background: (aka, XY-problem repellent)
I have a text-based combat game I'm working on, which involves the player sending commands such as

The latter causing the first three tubes to be pointed in the ship named raider's general direction plus 4 5 and 6 degrees for tube 1 2 and 3 respectively.

The backend physics code works fine, but the english parsing is not as flexible as it could be.

I've come up with a decent way to allow the parameters to be listed in any order, provided that the tags "dispersion" and "offset" are present to identify which number is which:

my $number = qr#(-?\d+\.?\d*)#; my $mixNmatch = qr#(?:(?:\s+[d][a-z]*\s+$number)|(?:\s+[oa][a-z]*\s+$n +umber))*#; while ($_ = <>) { no warnings; # dealing with undefs is clutter for this example print /^lock$mixNmatch\s*$/i ? "Yes: disp=$1, offset=$2\n" : "nope\n +"; }

The idea being that "d", "disp" and "dispersion" are all acceptable to mark a dispersion, while "offset", "angle", "ang", etc are all acceptable to mark an offset/angle. The ugly part is that "awoogha" also counts as marking an angle simply because it starts with an 'a'.

And thus I ask, is there a better way, or is this good enough?


The monks can fry your fish, and they can give you some tips and some bait, but you still need to wake up and climb onto the boat.

Replies are listed 'Best First'.
Re: Regex - Matching prefixes of a word
by Corion (Patriarch) on Jul 24, 2009 at 16:09 UTC

    You can require a space or end of string after your word, and you can make parts of your word(s) optional using the (?:...)? construct:

    /^an?$/

    will match any string that starts with a and maybe has a b following.

    /^a(?:n)?$/

    is the same, but using grouping parentheses.

    /^a(?:n(?:g)?)?$/

    will match a, an or ang

    As you can see, as the words get longer, the nesting gets deeper. Most likely, you will want to construct these word-matchers using some code instead of hand-crafting them.

Re: Regex - Matching prefixes of a word
by kennethk (Abbot) on Jul 24, 2009 at 16:19 UTC
    If I'm following you correctly, you want 'h', 'he', 'hel', 'hell' and 'hello' all to match when the test is for 'hello'. You can use \b (Using character classes) to anchor at the start (or end) of the word (it's a zero-width match), and then a series of ? (0 or 1 occurrences, Matching repetitions) in a set of nested Non capturing groupings with | (alternators, Matching this or that). So for hello, you could use:

    $line =~ /\bh(?:\b|e(?:\b|l(?:\b|l(?:\b|o))))/i;

    Where I've added case-insensitivity for good measure. Note that this does not match 'howdy'. Given that you'll want to do this for multiple words, I imagine, you'll probably want to auto generate your regular expressions, using something like:

    use strict; use warnings; while (<DATA>) { my $regex = '\b' . substr $_,0,1; foreach my $letter (split //, substr($_,1)) { $regex .= '(?:\b|' . $letter; } $regex .= ')' x ((length)-1); print $regex; } __DATA__ hello

    Update: As usual, ikegami's code is better than mine. A rewritten autogenerator using his pattern (Update 2: including dependence on Perl version for regex):

    use strict; use warnings; my $five10 = $] > 5.010 ? 1 : 0; while (<DATA>) { my $regex = '\b' . substr $_,0,1; foreach my $letter (split //, substr($_,1)) { $regex .= '(?:' . $letter; } $regex .= (')?' . '+' x $five10) x ((length)-1) . '\b'; print $regex; } __DATA__ hello
      That's overly complicated. Move the \b to the end.
      $line =~ /\bh(?:e(?:l(?:l(?:o)?)?)?)?\b/i;
      In 5.10, you can even disable needless backtracking
      $line =~ /\bh(?:e(?:l(?:l(?:o)?+)?+)?+)?+\b/i;

      I hadn't considered generating the regex algorithmically, and it is a great idea. I was concerned about complexity and readability of the regex itself, but I missed the (now) obvious curtain to hide the man behind :)

      A hash of huge, ugly and efficient pregenerated regex fragments to include will work wonderfully!

Re: Regex - Matching prefixes of a word
by ikegami (Patriarch) on Jul 24, 2009 at 16:27 UTC

    You can easily check if a word starts with another using index "backwards":

    index("hello", $word) == 0

    I don't know if it would be simpler to use this approach or not.

Re: Regex - Matching prefixes of a word
by psini (Deacon) on Jul 24, 2009 at 16:21 UTC

    What about reversing the approach?

    Instead of making a regex that matches any $word that is a prefix of 'angle', why don't use $word as a regex:

    my @keywords = qw / angle offset fire torpedo /; my @match = grep $_ =~ /^$word/ @keywords;

    If you previously ensured that word is a word (i.e. /\w+/), @match will give you all the keywords that partially match $word. If @match == 0, no match; if @match == 1, exact match; if @match > 1, word is ambiguous.

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

      I actually did use that pattern in a few limited other places in the code, but as far as I see, it can only be practically used after you've parsed out a parameter and want to know what things that specific parameter matches.

      >replace officer sci with Spock if ($command =~ /^(?:replace\s+)?officer\s+([a-z]+)\s+with ([a-z]+)/i) { $crewStation = $1; $newOfficerName = $2; $officer->{science} = $newOfficerName if ('science' =~ /$crewStation +/); ... }

      I don't see how that trick could be used like a subexpression to help out the first command recognition regex. (Being usable as a subexpression is critical)

      However, I do very much like the idea of grepping over a list of keywords in order to make synonyms trivial to add.

Re: Regex - Matching prefixes of a word
by JavaFan (Canon) on Jul 24, 2009 at 16:45 UTC
    I'd isolate the first "word" the player gives, and then check if it's a prefix. For instance:
    my ($first) = /^\s*(\w+)/ or next; if (index $first, "hello") { ... whatever ... }
    But if there's a limited number of commands a player can give, I would pre-calculate all unique prefixes, store them in a hash (with the real word as value), and then just do the hash access. For instance:
    my %prefixes = calc_all_unique_prefixes; if (/my ($first) = /^\s*(\w+)/) { if (exists $prefixes{$first}) { do_command $prefixes{$first} } else { print "What?\n"; } }

      A slight variation on that would be to have the hash indicate what to do rather than just what the word is. Split the command into words, and then follow the hash references to decide what to do, until you reach a code reference, or a !exists (indicating success or invalid command). It could be combined with the backwards regex trick, or the automatic regex generation to avoid having to list all substrings.

      One challenge with that approach is how to deal with optional fields and variable length lists of parameters. I suppose that line of thought could quickly turn the project into a full LR parser / evaluator.

Re: Regex - Matching prefixes of a word
by CountZero (Bishop) on Jul 25, 2009 at 09:38 UTC
    This approach will only gets you deeper and deeper in more and more complicated regexes and levels within levels of if ... then ... else ... constructs. And if you add another command you have to start all over.

    Better then to write a real and robust parser for your command language.

    First you have to invent the grammar and syntax for your commands. Once you have that structure made, you can start thinking of writing the parser:

    1. Split the command into tokens
    2. Check each token for being a valid command or parameter (here the "partial" command regex can be used!)
    3. (re)assemble the tokens into a canonical format, which your backend engine can directly understand.

    Higher Order Perl has some good examples how to do this.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      I do not actually have any nesting in the command structure. The code in the OP was just a simple way to test suggestions.

      The code actually used in the game is very easy to chain together:

      # Fire! $cmd =~ /^$regexSubstringOf{fire}\s+$regexName$regexIndexes\s*$/i ? se +tCommand($player, 'fire',[] , $1, $2) : # Lock weapons on a target $cmd =~ /^$regexSubstringOf{lock}\s+$regexName$regexIndexes\s+(?:on\s+ +)?$regexName$regexLockParams?\s*$/i ? setCommand($player, 'aim',[$4,$ +5,$3] , $1, $2) : ...
      Adding new ways to say a command is as easy as copy-pasting one of those lines. Adding new commands is slightly more complicated, as it requires adding backend code to the addCommand() function too.

      I had briefly considered a parser at first. I have already got one from another project that can evaluate math formulas, save variables and even do a form of looping. It didn't seem like a good idea at the time, because the token overlap (lock on dock, dock with lock) seemed like trouble, I expected I'd have to put at least as much effort into deciding based on tokens as on the regex captures, and finally, the direct regex way seemed pretty simple and straightforward.

Re: Regex - Matching prefixes of a word
by ig (Vicar) on Jul 24, 2009 at 22:44 UTC

    update: psini has already suggested this.

    If you want to allow command words to be abbreviated to uniqueness, something like the following might appeal:

    use strict; use warnings; my $word = shift; my @words = qw(lock tubes on raider dispersion offset fire all torpedo +es); my $pat = quotemeta($word); my @matches = grep (/^$pat/, @words); if(@matches == 0) { print "Unknown command \"$word\"\n"; } elsif(@matches == 1) { print "Recognized \"$word\" as command $matches[0]\n"; } else { print "Ambiguous command \"$word\" could be any of: @matches\n"; }
Re: Regex - Matching prefixes of a word
by Marshall (Canon) on Jul 25, 2009 at 01:58 UTC
    This appears to be more of a language grammar question than just a regex question.

    for example: lock tubes 1 2 3 on raider dispersion 2 offset 5
    lock tubes 1 2 3 on raider dispersion 2 offset 5
    "lock" and "on" are "noise tokens".
    why wouldn't this mean the same thing?
    di 2 of 5 t 1 2 3 rai
    raider(this target perhaps amongst many), dispersion 2, offset 5 deg, tubes 1,2,3.
    Also, sounds like instead of "rai di 2 of 5 t 1 2 3", I could say : raider d 2 o 5 f 1 2 3 ? raider p(dispersion) 2 o(offeset) 5 fire(tubes) 1 2 3?

    It appears to me that the grammar needs to be worked out for each "command" to the crew.

    Interesting problem.

    I think that it is easy to tell if each token on command line is number or not and if it is a alpha token, whether that is unique or not - I think the hard part is to determine what happens. I'm sure that there must be other commands like "dive and maintain 300 ft" or whatever.

      "Lock" can't really be a noise token, since it is the verb of the command (are you trying to lock, fire, enable, disable, etc those tubes?). "On" is definitely noise, and optional. While numbers must be numbers, the names can have numbers mixed in too. At the moment they could be entirely numbers, although I could restrict that. Still, how do you decide whether 'torpedo' is meant to be a ship component, or an object on the sensors? Lock countermeasure (on) torpedo? Or Lock (S.S.)Countermeasure (with) torpedo ?
      And then what happens when you come up to a port, and try to "dock lock" instead of "lock dock"? :)

      What I have gone with so far is a general layout of {verb} {listenerCategory {all|list of numbers}} {parameters}. Sometimes there are no parameters ("enable all drives"), sometimes there are no listeners ("course 90"), but the order holds.

      The parameters are the most flexible part, but often limited to just one or two values anyways. Sometimes the noisewords might be required to disambiguate the command, such as when the parameters start with a number. I don't have any such situations in the current configuration, and I can't think of a good example. Even "load tube 1 2 3 40" would correctly match the 40 as your ordnance type and not torpedo tube #40, thanks to backtracking.

      Note: I don't actually check the correctness of the names or categories in the regex; provided that it looks like a specific command from a high level view, I capture the fields and check them in the addCommand() function.

      If the parameters are not named, then they have to be in a fixed order. ("helm 180 5" vs "helm course 180 speed 5" or "helm speed 5 course 180)
      The listeners are restricted to an exact match from a list of category synonyms. "Disable all R" does not shut down the reactor and the radar and the railgun in one go; you have to do those separately unless the config has defined a common category for them. By default, the config intends such a set of actions to cost multiple turns unless there is such a common category.
      Object targets, on the other hand, only select a single object from your sensor database hash, and that allows a fuzzy match, with a clarification response if there is not exactly one result.

      As far as layout, I've got a trinary operator chain where I link each regex to a function call (addCommand()) that checks params, munges them into a set of sub refs and then queues the final result up for action when the turn ends. ("Belay that" is available to pop the order queue, just in case) :)

      Each command has its own regex or three to provide for alternate syntax without getting too line-noisy, and adding new commands is just a matter of tacking a new $cmd =~ /regex/i ? addCommand('what', $1, $3, $2) : type line onto the end of it.

        WOW! sounds like a very cool game! I agree with CountZero that this thing doesn't appear to lend itself well to a single regex per command entry.

        I suspect that you will wind up saying doing a first pass that just identifies the type of tokens on command line: alpha numeric alphanumeric. So for example from what I can tell an alphanumeric thing is always wrong "Speed10", "10speed", or maybe not? - I'm guessing "speed 10" and "10 speed" would be ok, but a "run-together" thing combining 'speed' and '10' isn't valid. If you put regex code that says that "10speed" isn't valid into each command "regex", this could get to be a very big mess!

        I think that you are into quite a bit more than regex'es and will need a parser of the command line. But depending upon the grammar, it may not be that complex and is perhaps even easier than the regex type approach.