AndyH has asked for the wisdom of the Perl Monks concerning the following question:

The question about time and date handling reminded me of a problem I have been putting off for too long.

I have a membership database (about 4500 records) - actually it's a pipe delimited Unicode file, but don't worry about that. The records contain start and end dates (just year month day, no times) of when people were members. A complication is that many of the people have joined and left several times, i.e. they have several start and end dates. Current members have just a start date and no end date (although they may have both start and end dates for previous periods of membership).

My problem is that I need to be able to ask questions like show me everyone who was a member on date X in the past or show me everyone who is now a member or who has been a member in the past ten years.

Date::Range sort of looks like it might help. Is that the right approach (especially when there will be 4500 x 2 or 3 date ranges to be considered)?

Thanks
Andy

Replies are listed 'Best First'.
Re: Multiple date ranges
by BrowserUk (Patriarch) on Feb 03, 2003 at 15:38 UTC

    Convert your dates to days since your membership started. Date::Manip is excellent at parsing dates in pretty much any known format (though the docs are little incomprehensible, its worth the effort persevering with them). Its function ParseDate() and Date_Calc() will do all the work for you.

    The basic idea is to build up a string one char per day, '1' when they were a member and '0' when not. Then testing for membership on a particular day involves just calculating the offset for the day in question an using substr to test if there's a '1' there. For a range of days, calculate the offset and length, substr the string and use a simple m// to see if there is a '1' anywhere in the substring. That's is really.

    If the length of the strings times the number of members, eg. 10years * 356 days * 4500 members = approx 16MB is considered too much then you could trade substr in for vec and use bits instead of bytes. A little more complex, but 2MB only.

    I've heard that Date::Calc is almost as good at parsing strnge date formats as Date::Manip and runs much faster because its in C. I haven't had occasion t ry it.

    A bit of sample code to demonstrate the idea. Kinda of works but has bugs. Might get you started though with the date stuff.

    #! perl -slw use strict; use Data::Dumper; use Date::Manip; my $days_since_open = 365*3; # 3 years my $base_date = ParseDate( '1st January 2000' ); my %members; while(<DATA>) { my @stuff = split/\|/; my $member = shift@stuff; $members{$member}= '0' x ($days_since_open); # print $member; while (@stuff > 1) { my $date1 = ParseDate( shift@stuff ); my $start = Delta_Format( DateCalc( $base_date, $date1), 0, '% +dh' ); my $date2 = ParseDate( shift@stuff ); my $length = Delta_Format( DateCalc($date1, $date2), 0, '%dh') +; substr($members{$member}, $start, $length) = '1' x $length; } if (@stuff) { my $last = ParseDate( shift@stuff ); my $last_offset = DateCalc($base_date, $last ); my $current = Delta_Format( $last_offset, 0, '%dh'); substr($members{$member}, $current) = '1' x ($days_since_open +- $current); } } my $start = do{{ print 'Start date?: '; warn "Bad date, try again", redo unless ParseDate(scalar <STDIN>); }}; my $end = do{ print 'End date?: ';ParseDate(scalar <STDIN>); }; #print "$start, $end"; my $start_days = Delta_Format( DateCalc( $base_date, $start ), 0, '%dh +' ); my $end_days = Delta_Format( DateCalc( $base_date, $end ), 0, '%dh' ) +if $end; #print "$start_days, $end_days"; for my $member (keys %members) { print $member, ' was a member ', $end ? 'between ' : 'on ', $start, $end ? ' and '.$end : '', if substr($members{$member}, $start_days, $end_days||1) =~ m[1 +]; } =pod Output C:\test>232212 Start date?: 1st April 2000 End date?: ^Z Use of uninitialized value in string eq at e:/Perl/site/lib/Date/Manip +.pm line 4155, <STDIN> line 1. Fred was a member on 2000040100:00:00 Barney was a member on 2000040100:00:00 Wilma was a member on 2000040100:00:00 C:\test>232212 Start date?: 1st may 2000 End date?: 1st december 2002 Fred was a member between 2000050100:00:00 and 2002120100:00:00 Barney was a member between 2000050100:00:00 and 2002120100:00:00 Betty was a member between 2000050100:00:00 and 2002120100:00:00 Wilma was a member between 2000050100:00:00 and 2002120100:00:00 C:\test>232212 Start date?: 1st may 2000 End date?: 1st december 2001 Fred was a member between 2000050100:00:00 and 2001120100:00:00 Barney was a member between 2000050100:00:00 and 2001120100:00:00 Wilma was a member between 2000050100:00:00 and 2001120100:00:00 C:\test> =cut __DATA__ Fred| 2000-march 31 | 2000/april/7 | 19/jan/2002 | 24th february 2002 +| 16sep2002 Barney| 15 february 2000 | Feb 15th 2002 | 1 dec 2002 Wilma| 01apr2000| 2002/19sep Betty| 01DEC2002

    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      Thanks for this. Nice oblique approach. One question: is there anything in your method that is going to be upset by the fact that the value of "BaseDate" is going to be way back in 1949? (I'm thinking Unix epoch stuff here, not just the number of days since then.)

        As far as I can tell, because Date::Manip uses it's own/ISO representation internally for dates, times and delta's, Gregorian -v- Julian -v- other calander systems aside, it should do the right thing for pretty much any set of dates that you want to throw at it. I haven't personnally verified this, but the module's been around long enough that any such fopars would probably have shown up by now.

        As for that code that I posted (besides being bug-ridden), it's just long strings of 1's and 0's, so there shouldn't be any problems as your only representing days. However it may not accurately account for leap days as you can't represent 0.2439 (from memory) days in a bit or byte. Meaning that if you need the kind of accuracy that would account for this, the method may not be useful to you. People joining or leaving on Feb 29th, or counts of active members on ranges that start or finish on Feb 29th could be mis-counted. I can't see an easy fix for this.

        The only other comment is that with a range of 50+ years and 4500+ members, the storage requirements rise to close to 80 MB. Moving to using a bit/day instead of byte/day this would reduce to 10MB which might be worth the effort, but vec isn't a direct substitute for substr unfortunately, so the code gets more complicated.


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Multiple date ranges
by Anonymous Monk on Feb 03, 2003 at 12:59 UTC
    actually it's a pipe delimited Unicode file, but don't worry about that

    That is your main problem. Move to a relational database (SQL) and your problem is simplified to simple SQL syntaxt.

    You'll be better off with a lightweight SQL database like DBD::CSV or DBD::SQLite everytime (if you cannot afford Mysql/Pgsql and the like).

      I knew I should have qualified the statement about the pipe-delimited file. This is one of those "*I* wouldn't have done it this way, but I've inherited this system" problems.

      In the long-term, I indeed do intend to move it to a DBI/DBD based system, but there are all sorts of other details I won't bore you with that are going to make that a timescale of months, not days. Meanwhile, I still need to solve this in a non-SQL way using what I have: hashes made by reading the pipe-delimited file ...

      A.

        What format are the timestamps in?

        Convert the timestamps to seconds (use any means neccessary).

        Create a datastructure resembling

        $foo{'bar'} = [ 1, # first joined 2, # left 3, # joined again 4, # left 5, # joined again ];
        If a user has an odd number of timestamps, he's still a member.

        It's fairly obviouls how to go about querying for other values.

        If the code is clean, switching to DBI needn't take more than a few days (i got a feeling its not though *shudder*).

        You don't need to solve this in a non-SQL way.

        Take this opportunity to port your database to SQL (build it out of the pipe file).

        Then run SQL queries (which is what you're trying to do), so by the time your frontend catches up (the other code), you already have a nice database/search backend all ready.

        Using CGI::Application the re-development of your frontend can take lots and lots and lots less time.

        Using DBD::SQLite can save you the setup of a real database.

        use DB_File; with the $DB_HASH and allow keys to have multiple values.

        Make the the keys the date stamps, and the values user names.

        Now you can do partial matches, and can have a many-to-one relationship (on 2003-02-01 , 10 users joined, 4 left)

        Since you can have multiple values, you can make the values contain if a user left or joined (pipe it to me), or you can have 2 hashes (%joined, %left), or simply prefix the date strings (left-2003-02-01 or joined-2002-09-22) so you can keep a single %hash.

        I'd do it this way, cause I love DB_File.

        You don't need to solve this with SQL if you don't want to. In perl, TMTOWTDI. I would suggest you look at the module Date::Set. You should be able to add to the sets like this:
        $a->event( start => '19971024T120000Z', end => '19971025T120000Z' ); # one period $a->event( start => '20001101T120000Z'); # open-ended
        Simple set-inclusion queries will tell you who is active on a given date. SQL might be better in the long run, but something like this would get you going.
Re: Multiple date ranges
by steves (Curate) on Feb 04, 2003 at 05:33 UTC

    I agree that using a SQL model is best. Have you looked at DBD::AnyData? It sounds like you may be able to use it without porting to a database -- just pointing it to your data. I never used it though, so I'm no expert. I started down that path once but ended up importing the data to an existing relational DB instead.

      I did look at AnyData before, as the data on this project is in both pipe-delimited (i.e. CSV) and XML and because it offered both DBD and Perl flavour access, but I didn't have much luck with it - specifically, it wouldn't (for me, anyway) save back changes to the XML (even though I took note of the caveat about having to specifically do adExport). In fact, I had lots of trouble with adExport. And, searching around, I couldn't find any code using it for XML. So, I sort of gave up on it.

      Maybe it's time to update the XML::Parser and XML::Twig and try again ...