pboin has asked for the wisdom of the Perl Monks concerning the following question:

I have a algorithm problem -- one that makes me realize how much I'm missing by not having any formal CS education....

I want to deduce some appropriate groupings for the following situation:

I have three tables:

So, essentially, I have a two-dimensional grid, and if the bit in the grid is 'on', access will be granted. Now, just for discussion, let's say I have 1000 users and 100 reports. That's up to a million different, individual access permsissions that may (or may not) have been granted.

I know for sure that there are some patterns in here. Alice, Bob, and Carlos all work in Accounts Receivable, so they might get the same few reports. Likewise, Rob, Sarah and Tom might have some reports in common.

Now, I don't know who works with who, I only know this grid of 100x1000. How can I mathematically figure out which 'sets' of reports can be assigned to people instead of individual user<->report rules so that I have some efficient groups and long-term administration becomes a bit easier?

This is my starting theory:

The major problem with that is this:

I might get 4 reports. If you're my boss, you might get my 4 plus 2 more. Logically, the four we have in common could be grouped, and you'd have two individuals in addition to the group of 4. My plan would miss that entirely, and I have a strong gut feeling that I'm ignorant of some nifty methodology for doing this kind of thing.

The only other thing I've thought of is doing some kind of ratio on how well any two users match. In the prior example, it would be a 4/6 or 67% match.

I just can't quite get my head around this. I studied accounting in school, and we pretty much stopped with addition and subtraction. <g> Grouping makes my brain hurt, but since this seems like a common need, I'd think this wheel should be laying around somewhere, just not in a form that I'm recognizing at the moment.

Thanks Monks.

Update:

This question is coming up again. In response to Tilly's point about re-designing the system, that's been done. There is a 'group' function implemented, and it works correctly. However, the existing system has the old recip/report records in place and no groups back-filled.

I'm looking at 17k recipients and 121k reports, so the possibilities are *massive*, around 2billion 'squares' in the grid. 1.7M of them are in the true state, so it's very sparse.

This question is not one I have to solve, it's just one that I'd like to for learning's sake. I've put a lot of thought into it, but I just don't know enough about math theory to figure it out yet...

Replies are listed 'Best First'.
Re: Deducing Ideal Groups
by tilly (Archbishop) on Apr 13, 2005 at 02:01 UTC
    You are trying to solve a problem which is much harder than the one you need to solve.

    The problem with trying to find automatic groupings is that you have no idea what the business logic is behind the groupings. For instance all 3 people who see a particular report may see another, but after you group them a month later you're asked to make someone see the first but not the second. And your scheme needs modification.

    The problem that you need to solve is much easier - how can you make the permission system more sensible. The answer to that is build a system that allows people to create groups, assign people to belong to those groups, and assign reports that those groups can see. Find ways to encourage people to handle things this way, and let the people with the business knowledge wind up creating your grouping for you.

    What they come up with may not be theoretically optimal according to some theory, but they'll understand it (huge win), and they are more likely than your algorithm to come up with groupings that will make sense as more reports get added.

Re: Deducing Ideal Groups
by Joost (Canon) on Apr 12, 2005 at 20:30 UTC
    How can I mathematically figure out which 'sets' of reports can be assigned to people instead of individual user<->report rules so that I have some efficient groups and long-term administration becomes a bit easier?

    It depends. :-) You could think up an algorithm that determines the current grouping according to the table, but chances are that that grouping won't be consistent in the long run. My guess is that you'll be better off if you try to find "logical" groups, possibly creating a few more reports for some people (i.e. put people in groups that have all the reports they need plus maybe one or two that they don't).

    Look at it this way: If you don't group people, you have 100 groups (1 for each report). If you try to minimize the amount of groups, you might end up with groups that don't make any sense (and thus need a lot of maintenance).

    I'd probably go for some mechanism that allows me to tweak the groups manually, starting with a few obvious groups (like the Accounts group) and some additional small groups to make up for the reports that don't match well. Depending on the exact data, that might cut down the amount of time spent on managing the reports by 80% or so.

    Just a thought.

Re: Deducing Ideal Groups
by punch_card_don (Curate) on Apr 12, 2005 at 20:44 UTC
    And the day after you've formed these groups, Rob & Tom get promoted, your boss is charged with embezzlement, and the new CIO re-defines who gets which reports. So you have to run your grouping routine all over again.

    This is not a big problem. 1,000 users - create a database called "user_reports" and store the information report-wise, that is, who is allowed to see report 'x'. It could be an rdbms, or just simply a text file with a list of user names. You'll find that searching 1,000 records for a given report takes practically no time at all.

    Forget that fear of gravity,
    Get a little savagery in your life.

Re: Deducing Ideal Groups
by tlm (Prior) on Apr 13, 2005 at 00:25 UTC

    Bioinformaticians deal a lot with a very similar kind of problem. There's this relatively new family of techniques for doing high-throughput experiments that is based on so-called "microarrays". You can think of it is miniaturized labs; each microarray is a matrix of experiments. For example, the rows can correspond to genes, and the columns can corresponds to different patients in a longitudinal study. But there are a zillion variations for what the rows and columns can represent. What they all have in common is that, at the end of the day one has a whole bunch of matrices of numbers (each cell in the matrix corresponds to the measurement yielded by that cell's micro-experiment), and one needs to somehow make sense of it, and the first thing people want to do is exactly what you want to do: come up with some way reorder the rows and/or columns of the array so that "similar" rows and/or columns are close to each other. This reordering process is called "clustering." Here's a picture of a typical microarray, and figure 3 of this paper illustrates the results of a clustering algorithm.

    As you can see, the main difference between your problem and what the biologists are dealing with is that your matrix consists of only 1s and 0s, whereas theirs usually hold continuously varying values. I doubt that the clustering methods would differ significant between these two variants of the problem. BTW, the paper I linked to above uses a relatively simple clustering algorithm that would not be too hard to implement in Perl (though it'd probably be too slow). Their implementation is in C++, and it's available here; it's called Cluster, and only a Windows version is available. (It also comes with an unintentionally hilarious license agreement, or at least it used to, whose terms included the condition that forbid licensees for commenting on the quality of the code.)

    I looked around for some turnkey implementation of one such clustering algorithm but I could not find anything that was sufficiently streamlined and accessible that it could be easily adapted to your problem. But this is not my area of expertise; maybe some of the bioinformatic-savvy monks will give you some more better pointers.

    The other issue is that all the implementations of this sort of stuff that I've seen in Perl usually require something like PDL, because these algorithms tend to rely heavily on vector and matrix computation, and out-of-the-box Perl is just too slow for it (at least for matrices with hundreds of rows and columns).

    I'll ask around some more. What's your OS?

    the lowliest monk

      Interesting. I've been stirring this problem around for a couple of months, bringing it to the forefront occasionally. In the past, I looked at a package called AutoClass, from NASA. It rang a bell right away when I saw your microarray picture, as that's exactly what's on the front of the AutoClass page. However, I've not been able to make the jump from items with many varing-but-enumerated characteristics to items with two states.

      Thanks for the info, I'll study it more carefully today. BTW, my workstation is Debian, and the answers will ultimately be used by an application that runs on IBM mainframes (z/OS).

Re: Deducing Ideal Groups
by cbrandtbuffalo (Deacon) on Apr 12, 2005 at 21:02 UTC
    In my experience, the rules for groups are driven by business logic rather than computational ease. That is, there are rules somewhere that determine who can see what and these rules can be based on law, position in the company, clearance level, or personal preference of the rule-maker.

    If your concern is for performance, I wouldn't worry too much until you see the performance problem manifest itself. I've worked on systems that were simple user->access thing before that scaled pretty well. (Caveat: these were on fairly large hardware.) I think you may have some space before you hit a wall.

    If your concern is maintenance, I think maintenance is easier if the maintainer can hang the group on something tangible. For example, the Accounting example provided above. I think it would be very difficult to maintain groups based on some optimization paradigm. In fact, you would probably need to re-group over time as people moved around and the landscape changed.

    I'm not aware of any common solutions to the problem. My guess is that it's because the solution is often specific to an organization. So if you hit the performance wall, you may need to work with the people who decide on report access and determine how people can be grouped more efficiently.