Suggestions on storing loosely similar data

alienhuman has asked for the wisdom of the Perl Monks concerning the following question:

This isn't strictly a perl question, it's more a db design question--but I'm going to be using perl eventually to get, insert and manipulate this data. If this is inappropriate for PM, please delete this writeup.

Consider:

I have clients who come to my site and run reports on financial data (well, the company I work for does).
They pay for each report they run
They pay less if they re-run a report w/in a certain time period
These reports have widely varying query parameters

I'm storing the information about who ran what report when in a single table (this is a legacy). One of the fields is an ID that is generated using a uid, a timestamp and a varying list of whitespace delimited strings that denote various report options.

For instance, one report id might be something like "uid333 200309091230 H J Krm 3 2003-6-30" while another may be "uid333 200309091231 A L foobar 4". So you can see that while these ids have something in common, they also have a lot that's not in common. The other columns in the table are identical. The id is what is used to determine if the client has run the report previously within a specified time limit, and they are charged less for the subsequent runnings.

For example, if I run the report that corresponds to "uid 200309091230 H J Krm 3 2003-6-30" in the morning, and again after lunch, the report running script compares the ids (generated partially from the report options) and if there is a match, compares timestamps (data in another column) and charges accordingly.

My question is this: does anyone have any thoughts about how to store the report options in such a way that eliminates this cumbersome report id string? How does one deal with data that is loosely similar?

As a p.s., the # options for a report are not predictable. In other words, it's not enough to know that I ran a XYZ report at 2pm. You must know that I ran an XYZ report with the values 1,1,2,3,5,8. Which is entirely different than the XYZ report with values 5,8,13 or just 21,baz.

p.p.s. the report running script currently does some chopping of the report id in various places and some pattern matching to determine a match.

Comment on Suggestions on storing loosely similar data

Replies are listed 'Best First'.

Re: Suggestions on storing loosely similar data
by Roger (Parson) on Sep 10, 2003 at 23:50 UTC

Table: access_log
seq_no	Sequence Number
uid	User ID
access_time	Timestamp
report_id	Report ID
report_option	Report Option

(I missed the sequence id in my post earlier, here's the amendment, thanks! And of cause this is to assume the traffic volume is relatively small.)

[reply]

Re: Re: Suggestions on storing loosely similar data

by zby (Vicar) on Sep 11, 2003 at 07:40 UTC

The root of the problem is that this table is still not enough normalized. I would make it two tables one for uid, access_time, report_id and the key seq_no and another for just report_option and key seq_no.

[reply]

Re: Re: Re: Suggestions on storing loosely similar data

by alienhuman (Pilgrim) on Sep 12, 2003 at 15:31 UTC

Let me see if I follow. This table is an access log (reports run):

seq_no (auto incremented key)	uid	access_time	report_id
this record	who	when	which report

And then there's another that tracks the individual options for each report in the access log:

seq_no (key)	report_options
record lookup	string of report options

Is that what you were proposing? It seems good to me, though I want to percolate on that for a bit (since I'm still getting my head around how this access log is going to be used... and of course, you all have no idea :) ).

So my code would do this to figure out how many reports were run by each uid w/in a date range (excuse my psuedo code in advance):

(psuedo code)
select count(seq_no) where uid like $uid and report_id like $report_id
+ and $access_range1 > access_time > $access_range2
[download]

And to check to see if a user was re-running a report (ie running a report w/ identical options w/in a certain time period) I'd match report_id, uid and access_time, to get seq_no that matched. Then I'd check the suspected re-run report options with report_options in table2.

Sound about right? Is there further normalization that could happen? Thoughts on splitting the report_options out like this? Anything else? I really appreciate your thoughts on this.

[reply]
[d/l]

Re: Re: Re: Re: Suggestions on storing loosely similar data

by zby (Vicar) on Sep 12, 2003 at 16:22 UTC

Re: Re: Suggestions on storing loosely similar data

by bean (Monk) on Sep 11, 2003 at 03:30 UTC

This looks good, except I would give every record a sequential id - otherwise you have a four-column key...

[reply]

Re: Suggestions on storing loosely similar data
by adrianh (Chancellor) on Sep 10, 2003 at 22:10 UTC

If all you're interested in is having a unique ID to compare and you don't need to extract the report options from the ID then taking an MD5 of the string would seem to be one solution. Just store and compare the MD5 string.

[reply]