tlhackque has asked for the wisdom of the Perl Monks concerning the following question:

Before re-inventing the wheel, thought I should see if the wise monks have a solution:

Consider a program with a list of matching rules for files, say:

Now, consider a find-like program that traverses a directory (or otherwise gets a list of candidate files) and applies each candidate file to the ruleset, selecting zero or 1 actions.

Clearly, the user would expect the selected action to be the most specific match - which seems like "the one satisfied by using the fewest/least powerful wildcards from the selected rule"

Algorithimicly, it might be something like:

Alternatively, the apache config file rules would work, except that in this case one wants only the final action to be applied.

I would have expected that this problem has been solved many times before, but haven't seen it in CPAN or elswhere. (I'm not wedded to the exact syntax used above.) Any pointers? Any solutions?

thanks

This communication may not represent my employer's views, if any, on the matters discussed.

Replies are listed 'Best First'.
Re: Glob best match?
by moritz (Cardinal) on Jun 29, 2011 at 13:38 UTC
    You can just sort the globs, and try the most specific first, and execute only the action assoicated with the first match.

    You can define a score like this: eliminate all meta characters from the glob (so fred/*.mac becomes fred/.mac), and use the number of characters in that resulting string as a (crude) measure for specificity.

    If you don't want to assign scores, but just compare pairs of globs for which one is more specific, a topologic sort can help you to get them in order for comparison.

Re: Glob best match?
by Anonymous Monk on Jun 29, 2011 at 13:03 UTC
Re: Glob best match?
by graff (Chancellor) on Jun 30, 2011 at 10:31 UTC
    This last step in your proposed algorithm seems like too much work:
    • Otherwise, assign a score to each matching rule by counting the number of characters consumed by each wildcard. Accept the rule with the lowest score

    It should suffice to assign ranking scores to the rules in advance, based on the relative "wildness" allowed by wildcard operators, then apply the rules in their ranked order (most specific to least specific) and invoke the action for the first one that matches.

    #!/usr/bin/perl use strict; use warnings; use Text::Glob 'match_glob'; # let's make up some dummy subroutines my @actions; for my $act ( 1 .. 7 ) { my $msg = " is handled by action$act\n"; $actions[$act] = sub { print $_[0], $msg }; } my %rule = ( '*' => 1, '*.txt' => 2, '*.tx?' => 7, 'fred/*' => 3, 'fred/*.mac' => 4, 'george.txt' => 5, 'fred/george.txt' => 6, ); my %rankings; for ( keys %rule ) { my $rank = length() * 100 - ( tr/*// ) * 10 - ( tr/?// ); push @{$rankings{$rank}}, $_; } my @rank_order = sort {$b<=>$a} keys %rankings; while (<DATA>) { chomp; my $matched; for my $rank ( @rank_order ) { for my $glob ( @{$rankings{$rank}} ) { if ( match_glob( $glob, $_ )) { $matched = $rule{$glob}; last; } } last if $matched; } if ( $matched ) { $actions[$matched]->( $_ ); } } __DATA__ foo.bar foo.bar.txt foo.bar.txo fred/foo.bar fred/foo.bar.mac george.txt fred/george.txt
    I suspect that someone else could come up with a more clever way to assign ranking scores, but the idea above is: longer patterns have higher scores (are more specific) than shorter ones; a pattern of a given length gets the most points if it has no wildcards; it loses 1 point for each "?" and 10 points for each "*". I think this captures the user's expectation.

    (It would make sense for patterns with no wildcards to be given a uniform maximum score, but I don't expect this would make a big difference in performance.) ((Update: On second thought, depending on how many file names you're checking and the relative quantities/proportions of exact-match vs. wildcard rules, it might be worthwhile to test all the exact-match rules as a group first, using just eq instead of a call to match_glob, and this would be easy if you assign a uniform, constant max score to all the exact-match rules, regardless of their length.))

Re: Glob best match?
by pklausner (Scribe) on Jun 30, 2011 at 10:15 UTC
    Isn't that a standard problem in any filtering system, e.g. firewall rules? The standard solution is: list the more specific patterns first, then compare until you find the first match.

    An automagic score for the specificity = precedence of a rule sounds cute. It would allow you to sort the rules by whatever pleases your eyes. But I suspect it would make debugging quite difficult, if your mental model of "specific" doesn't match exactly the possibly complicated algorithm.

    So the mundane comparison from top to bottom is more tedious, but preferable for its predictability.