in reply to Data structure advice

A lot depends on how you plan on looking up times? Personally, I'd use a lightweight database, like DBD::DB_File, so that I can offload all my date arithmetic to something that does it, and does it correctly (and fast). Thus, your question could be reworded (using SQL) as
SELECT pressure FROM baro_over_time WHERE time_measured < ? ORDER BY time_measured DESC
Then, the first value returned is the one you want. (Since the values are in descending order, the last measurement before the time specified is the first one returned.)

------
We are the carpenters and bricklayers of the Information Age.

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Replies are listed 'Best First'.
Re: Data structure advice
by b10m (Vicar) on Jan 05, 2004 at 15:09 UTC

    I would suggest SQL too. If this "data gathering" goes on like the sample, you'll have a massive ammount of data within a not to distant future. You might want to look at a "real database", or, to keep it simple: (*ding* here it is again) DBD::SQLite

    --
    b10m
Re: Re: Data structure advice
by robinbowes (Beadle) on Jan 05, 2004 at 15:14 UTC
    Hi,

    Thanks for the replies.

    In this particular case, I define "best" as "quickest lookup" as there are only 48 values per day (I am processing each data separately) and I will be doing the lookup approx. 8000 times per day.

    I'd like to do the lookup something like this: my $pressure = get_pressure("2003-10-12 16:09:06"); Where $pressure is set to the most recent pressure value prior to "2003-10-12 16:09:06"

    I am prefectly capable of writing a function that searches through a list of values to return the required value. I was hoping that there would be a nice perl trick to save me the bother!!!

    R.

    --

    Robin Bowes | http://robinbowes.com

      The nice Perl trick is a SQL trick. And, don't write the function - it's already written! Use the tools you have. Plus, once you have it in a database, you're going to find other capabilities. Trust me.

      ------
      We are the carpenters and bricklayers of the Information Age.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        Hi,

        I am aware of the capabilities of using SQL, but feel that it is overkill for this application.

        I will have, at most, 48 pressure values per day to look up; these will be read from a .csv format file. I could, I agree, use DBI (I think DBD::CSV would be most appropriate) but I can't help thinking that the performance overhead will be too much.

        I suppose I could always just try it, i.e. profile rather than speculating.

        I'll wait to see if there are any other suggestions then knock something up tomorrow.

        Cheers,

        R.

        --

        Robin Bowes | http://robinbowes.com

      Hmm. How do you arrive at 48 values per day? The snippet you showed has 3 values per hour, which for 24 hour days is 72 values per day :-)

        Doh! There is a logical explanation for this...

        The data was extracted from the production system, following data processing. The raw data is generated every 30 mins at 20 and 50 mins past the hour. The values on the hour are averages calculated from the previous and following values.

        R.

        --

        Robin Bowes | http://robinbowes.com

      For a "quickest lookup", one would use a lookup table. Considering your query with second precisions, you'd need to store 62400 entries per day, so if you have to query data from lots of days, you might need to invest in memory. But it gives you the quickest lookup time.

      Abigail

        ...you'd need to store 62400 entries per day...

        I think you mean "86400 entries per day" otherwise I have no clue where you pulled that number from.

Re: Re: Data structure advice
by robinbowes (Beadle) on Jan 06, 2004 at 23:03 UTC
    Hi again

    I decided to give the database approach a go and see how fast it is.

    I'm using DBI::AnyData and creating an in-memory table, which should be pretty quick.

    However, I'm having trouble with my SQL. This works (i.e. it returns the time of the latest pressure data at the time of the event:

    SELECT max(weather_time) FROM pressure where (weather_time < '$event_t +ime')
    I'd really like to do this:

    SELECT max(weather_time), baro_pressure FROM pressure where (weather_time < '$event_time') But of course this is not valid SQL. Does the DBI interface support nested SQL? I've tried this:

    SELECT weather_time, baro_pressure from pressure where weather_time = (select max(weather_time) FROM pressure where (weather_time < '$radar_time')) This gives the following error:

    DBD::AnyData::db prepare failed: SQL ERROR: Bad table or column name '^0^' has chars not alphanumeric or underscore!

    Any suggestions?

    R.

    Update

    As a work around, I'm using two SQL queries to get the information. The following code snippet is from a function:

    my $sth = $dbh->prepare("SELECT max(weather_time) FROM pressure where +(weather_time < '$data_time')") or die "Cannot prepare: " . $dbh->errstr(); $sth->execute() or die "Cannot execute: " . $sth->errstr(); my $row = $sth->fetchrow_hashref; my $press_time = $row->{'weather_time'}; if (!defined $press_time) { die "No pressure data earlier than the time of the data point ($data +_time)" }; $sth = $dbh->prepare("SELECT baro_pressure from pressure where (weathe +r_time = '$press_time')") or die "Cannot prepare: " . $dbh->errstr(); $sth->execute() or die "Cannot execute: " . $sth->errstr(); $row = $sth->fetchrow_hashref; $sth->finish(); return $row->{'baro_pressure'};
    Now, this works, but is not exactly speedy.

    The data set I am testing with has data points from 2003-10-12, and I initially created a CSV file containing pressure data for all of October (it has 1331 pressure lines) but this proved to be painfully slow. I improved things by creating a file containing just pressure data from 2003-10-12 (it has only 46 pressure lines).

    This solution is functional, but I'd like to improve it so I can not mess about with subsets of pressure data.

    Perhaps it would be an idea to create some sort of indexed database file from the CSV file and use a DBI interface to that?

    Thoughts?

    R.

    --

    Robin Bowes | http://robinbowes.com

      Your two-query approach is often acceptable. However, if you want to combine them into one statement, try something like:
      # Assumes the time is coming in as 12:47:02 sub get_pressure { my ($data_time) = @_; my $sql = <<__END_SQL__; SELECT weather_time ,baro_pressure FROM pressure WHERE weather_time < TO_DATE(?, 'HH:MI:SS') ORDER BY weather_time DESC __END_SQL__ my $sth = $dbh->prepare($sql) || die ... $sth->execute($data_time) || die ... $sth->bind_columns(\my ($press_time, $pressure)); # You only want the first column returned. $sth->fetch; return $pressure; }
      (Note the use of placeholders ...)

      You're looking for the largest weather_time, so you can just order them descending.

      I'm not fully up on DBI::AnyData's restrictions. Personally, I would have used DBD::SQL_Lite and taken the overhead of the schema definition. *shrugs*

      ------
      We are the carpenters and bricklayers of the Information Age.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.