pboin has asked for the wisdom of the Perl Monks concerning the following question:

Chances are this is really just a vocabulary question...

I'm in the following situation: I've been burdened with a table, keyed on recipient, and report. That key basically defines which user has read access to which reports. OK, so far so good.

The problem comes in when years later, the table has ~16 million records in it, and administration is a nightmare. I need to programmatically divine what groups might be appropriate by looking at which reports are most commonly asssinged to which people (ie: by looking at the report-recipient relationships, make a SWAG at defining a reasonable 'Accounting' group, or maybe 'Sales', etc.). IOW, which groups should I create to efficiently reduce the greatest number of report/recipient relationships?

Coding up the group mechanism isn't so bad. I just need a way to go back and populate the new groups table. (FWIW, I looked into Debian's AutoClass package, and that doesn't seem to be quite what I need, but I don't exactly have a Masters in Math either.)

So, essentially, I believe this is a vocabulary question, or perhaps I'm asking for a 'group identification' algorithm.

Thank you for your time and kind consideration.

Replies are listed 'Best First'.
Re: Group Identification
by dragonchild (Archbishop) on Oct 21, 2003 at 17:22 UTC
    A few thoughts:
    • Grouping is a client-driven need. For example, only a human can tell that Report 9 and Report 12 are in the same logical grouping, but Report 10 isn't. You need a human to make those groups
    • Is there a way of defining that N reports are always assigned to someone in the Foo department? For example, everyone in Accounting will always use the X, Y, and Z reports. That may be a good grouping ...
    • You might want to consider changing your reports. For example, you don't need separate reports for:
      • "Report X by Month"
      • "Report X by Year"
      • "Report X by Category"
      Those should be "Report X". Then, if someone isn't allowed to see the by-month breakdown, you can control that in a separate access table. Since those restrictions are likely to be rare, you will reduce your table size.
    • Have you eliminated any extra data? For example, you can remove all the people that aren't there anymore, or all the reports that aren't there anymore ...
    • Also, you might want to consider coming at it from the other direction - are people allowed to see more reports than not? If the number of restrictions is less than the number of permissions, it might behoove you to flip the table from "allowed" to "denied".

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.