alienhuman has asked for the wisdom of the Perl Monks concerning the following question:

This isn't strictly a perl question, it's more a db design question--but I'm going to be using perl eventually to get, insert and manipulate this data. If this is inappropriate for PM, please delete this writeup.

Consider:

I'm storing the information about who ran what report when in a single table (this is a legacy). One of the fields is an ID that is generated using a uid, a timestamp and a varying list of whitespace delimited strings that denote various report options.

For instance, one report id might be something like "uid333 200309091230 H J Krm 3 2003-6-30" while another may be "uid333 200309091231 A L foobar 4". So you can see that while these ids have something in common, they also have a lot that's not in common. The other columns in the table are identical. The id is what is used to determine if the client has run the report previously within a specified time limit, and they are charged less for the subsequent runnings.

For example, if I run the report that corresponds to "uid 200309091230 H J Krm 3 2003-6-30" in the morning, and again after lunch, the report running script compares the ids (generated partially from the report options) and if there is a match, compares timestamps (data in another column) and charges accordingly.

My question is this: does anyone have any thoughts about how to store the report options in such a way that eliminates this cumbersome report id string? How does one deal with data that is loosely similar?

AH

As a p.s., the # options for a report are not predictable. In other words, it's not enough to know that I ran a XYZ report at 2pm. You must know that I ran an XYZ report with the values 1,1,2,3,5,8. Which is entirely different than the XYZ report with values 5,8,13 or just 21,baz.

p.p.s. the report running script currently does some chopping of the report id in various places and some pattern matching to determine a match.

  • Comment on Suggestions on storing loosely similar data

Replies are listed 'Best First'.
Re: Suggestions on storing loosely similar data
by Roger (Parson) on Sep 10, 2003 at 23:50 UTC
    Yep this sounds more like a database design issue, not a perl issue. If I was to design such a feature, I would create a user access log table somewhat like this:

    Table: access_log
    seq_noSequence Number
    uidUser ID
    access_timeTimestamp
    report_idReport ID
    report_optionReport Option

    Then it's just a simple matter of constructing your SQL queries to make use of the access log.

    By the way, you are using a database backend, aren't you?

    (I missed the sequence id in my post earlier, here's the amendment, thanks! And of cause this is to assume the traffic volume is relatively small.)

      This way you would have one record for every Report Option for every report. If you use the timestamp function in the insert clause it would be possible that the options get a different timestamps. This is not a desired situation.

      The root of the problem is that this table is still not enough normalized. I would make it two tables one for uid, access_time, report_id and the key seq_no and another for just report_option and key seq_no.

        Let me see if I follow. This table is an access log (reports run):

        seq_no (auto incremented key)uidaccess_timereport_id
        this recordwhowhenwhich report

        And then there's another that tracks the individual options for each report in the access log:

        seq_no (key)report_options
        record lookupstring of report options

        Is that what you were proposing? It seems good to me, though I want to percolate on that for a bit (since I'm still getting my head around how this access log is going to be used... and of course, you all have no idea :) ).

        So my code would do this to figure out how many reports were run by each uid w/in a date range (excuse my psuedo code in advance):

        (psuedo code) select count(seq_no) where uid like $uid and report_id like $report_id + and $access_range1 > access_time > $access_range2

        And to check to see if a user was re-running a report (ie running a report w/ identical options w/in a certain time period) I'd match report_id, uid and access_time, to get seq_no that matched. Then I'd check the suspected re-run report options with report_options in table2.

        Sound about right? Is there further normalization that could happen? Thoughts on splitting the report_options out like this? Anything else? I really appreciate your thoughts on this.

        AH

      This looks good, except I would give every record a sequential id - otherwise you have a four-column key...
Re: Suggestions on storing loosely similar data
by adrianh (Chancellor) on Sep 10, 2003 at 22:10 UTC

    If all you're interested in is having a unique ID to compare and you don't need to extract the report options from the ID then taking an MD5 of the string would seem to be one solution. Just store and compare the MD5 string.