Data structure advice

robinbowes has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Data structure advice by hardburn (Abbot) on Jan 05, 2004 at 14:54 UTC
When coming up with a data structure, ask yourself what you'll be needing to search by. That peice of data will be the first level of either a hash or array. If you can easily get your data into something reasonable for an array index, then use an array. Otherwise, use a hash. In your case, I'd guess you'd want something like this: `$struct {$date} {$time};` [download] Where $date is the date you're looking for, and $time is the time, and the data held at `$struct{$date}{$time}` is the barometric pressure value. ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer `: () { :\|:& };:` Note: All code is untested, unless otherwise stated	[reply] [d/l] [select]
Re: Data structure advice by dragonchild (Archbishop) on Jan 05, 2004 at 14:58 UTC
A lot depends on how you plan on looking up times? Personally, I'd use a lightweight database, like DBD::DB_File, so that I can offload all my date arithmetic to something that does it, and does it correctly (and fast). Thus, your question could be reworded (using SQL) as `SELECT pressure FROM baro_over_time WHERE time_measured < ? ORDER BY time_measured DESC` [download] Then, the first value returned is the one you want. (Since the values are in descending order, the last measurement before the time specified is the first one returned.) ------ We are the carpenters and bricklayers of the Information Age. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: Data structure advice by b10m (Vicar) on Jan 05, 2004 at 15:09 UTC
I would suggest SQL too. If this "data gathering" goes on like the sample, you'll have a massive ammount of data within a not to distant future. You might want to look at a "real database", or, to keep it simple: (ding here it is again) DBD::SQLite -- b10m	[reply]
Re: Re: Data structure advice by robinbowes (Beadle) on Jan 05, 2004 at 15:14 UTC
Hi, Thanks for the replies. In this particular case, I define "best" as "quickest lookup" as there are only 48 values per day (I am processing each data separately) and I will be doing the lookup approx. 8000 times per day. I'd like to do the lookup something like this: `my $pressure = get_pressure("2003-10-12 16:09:06");` Where $pressure is set to the most recent pressure value prior to "2003-10-12 16:09:06" I am prefectly capable of writing a function that searches through a list of values to return the required value. I was hoping that there would be a nice perl trick to save me the bother!!! R. -- Robin Bowes \| http://robinbowes.com	[reply] [d/l]
Re: Re: Re: Data structure advice by dragonchild (Archbishop) on Jan 05, 2004 at 15:17 UTC
The nice Perl trick is a SQL trick. And, don't write the function - it's already written! Use the tools you have. Plus, once you have it in a database, you're going to find other capabilities. Trust me. ------ We are the carpenters and bricklayers of the Information Age. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Re: Re: Re: Data structure advice by robinbowes (Beadle) on Jan 05, 2004 at 15:25 UTC
Re: Re: Re: Data structure advice by duff (Parson) on Jan 05, 2004 at 15:28 UTC
Hmm. How do you arrive at 48 values per day? The snippet you showed has 3 values per hour, which for 24 hour days is 72 values per day :-) duff	[reply]
Re: Re: Re: Re: Data structure advice by robinbowes (Beadle) on Jan 05, 2004 at 15:33 UTC
Re: Data structure advice by Abigail-II (Bishop) on Jan 05, 2004 at 16:16 UTC
For a "quickest lookup", one would use a lookup table. Considering your query with second precisions, you'd need to store 62400 entries per day, so if you have to query data from lots of days, you might need to invest in memory. But it gives you the quickest lookup time. Abigail	[reply]
Re: Re: Data structure advice by duff (Parson) on Jan 05, 2004 at 20:02 UTC
Re: Data structure advice by Abigail-II (Bishop) on Jan 05, 2004 at 20:55 UTC
Re: Re: Data structure advice by robinbowes (Beadle) on Jan 06, 2004 at 23:03 UTC
Hi again I decided to give the database approach a go and see how fast it is. I'm using DBI::AnyData and creating an in-memory table, which should be pretty quick. However, I'm having trouble with my SQL. This works (i.e. it returns the time of the latest pressure data at the time of the event: `SELECT max(weather_time) FROM pressure where (weather_time < '$event_t +ime')` [download] I'd really like to do this: `SELECT max(weather_time), baro_pressure FROM pressure where (weather_time < '$event_time')` But of course this is not valid SQL. Does the DBI interface support nested SQL? I've tried this: `SELECT weather_time, baro_pressure from pressure where weather_time = (select max(weather_time) FROM pressure where (weather_time < '$radar_time'))` This gives the following error: `DBD::AnyData::db prepare failed: SQL ERROR: Bad table or column name '^0^' has chars not alphanumeric or underscore!` Any suggestions? R. Update As a work around, I'm using two SQL queries to get the information. The following code snippet is from a function: my $sth = $dbh->prepare("SELECT max(weather_time) FROM pressure where +(weather_time < '$data_time')") or die "Cannot prepare: " . $dbh->errstr(); $sth->execute() or die "Cannot execute: " . $sth->errstr(); my $row = $sth->fetchrow_hashref; my $press_time = $row->{'weather_time'}; if (!defined $press_time) { die "No pressure data earlier than the time of the data point ($data +_time)" }; $sth = $dbh->prepare("SELECT baro_pressure from pressure where (weathe +r_time = '$press_time')") or die "Cannot prepare: " . $dbh->errstr(); $sth->execute() or die "Cannot execute: " . $sth->errstr(); $row = $sth->fetchrow_hashref; $sth->finish(); return $row->{'baro_pressure'}; [download] Now, this works, but is not exactly speedy. The data set I am testing with has data points from 2003-10-12, and I initially created a CSV file containing pressure data for all of October (it has 1331 pressure lines) but this proved to be painfully slow. I improved things by creating a file containing just pressure data from 2003-10-12 (it has only 46 pressure lines). This solution is functional, but I'd like to improve it so I can not mess about with subsets of pressure data. Perhaps it would be an idea to create some sort of indexed database file from the CSV file and use a DBI interface to that? Thoughts? R. -- Robin Bowes \| http://robinbowes.com	[reply] [d/l] [select]
Re: Re: Re: Data structure advice by dragonchild (Archbishop) on Jan 07, 2004 at 12:52 UTC
Your two-query approach is often acceptable. However, if you want to combine them into one statement, try something like: `# Assumes the time is coming in as 12:47:02 sub get_pressure { my ($data_time) = @_; my $sql = <<__END_SQL__; SELECT weather_time ,baro_pressure FROM pressure WHERE weather_time < TO_DATE(?, 'HH:MI:SS') ORDER BY weather_time DESC __END_SQL__ my $sth = $dbh->prepare($sql) \|\| die ... $sth->execute($data_time) \|\| die ... $sth->bind_columns(\my ($press_time, $pressure)); # You only want the first column returned. $sth->fetch; return $pressure; }` [download] (Note the use of placeholders ...) You're looking for the largest weather_time, so you can just order them descending. I'm not fully up on DBI::AnyData's restrictions. Personally, I would have used DBD::SQL_Lite and taken the overhead of the schema definition. shrugs ------ We are the carpenters and bricklayers of the Information Age. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: Data structure advice by Abigail-II (Bishop) on Jan 05, 2004 at 14:52 UTC
What's the best data structure to use to store this data? That depends on your definition of "best", the amount of data, the number of queries, and the precision of your data and query data. Two possibilities: An array, where the values are kept in order. To query, perform a binary search. A hash, with entries for every minute (or second, depending on the precision of your query data). Missing entries in the raw data are just copied. Abigail	[reply]
Re: Data structure advice by duff (Parson) on Jan 05, 2004 at 15:18 UTC
Others have answered your question much as I would, but I would like to add another possibility that no one seems to have offered. Perhaps you already have the "best data structure" -- a file. I don't know what your requirements are, but assuming your data file is in date order (as it appears to be), the cost of opening and searching through that file may be small enough for you to do it repeatedly. Of course, if "repeatedly" really means "many times per second" that's probably not true :-) Update: You said in an earlier post that you'll be looking up these values about 8000 times per day, that's about once every 10 seconds which certainly enough time to open and search a file. Especially if you do something smart like keep each days worth of data in separate files rather than all of the data in one big file. Also, if efficient look up is necessary and your data are coming in at fixed times like that, you could use a hash. Convert the date/time to a GMT value and for each date/time that you want to find the "nearest" barometric pressure, normalize it to one of your fixed times and do a hash lookup duff	[reply]
Re: Data structure advice by ysth (Canon) on Jan 05, 2004 at 18:38 UTC
To give a reasonable answer, a few of things need to be more clear. Are you going to load the information in a perl process that will stay alive and process your thousands of lookups? Or does each lookup have to load the data from the .csv file? How many total data points will there be? Will new data be arriving and need to be loaded as you are processing?	[reply]
Re: Re: Data structure advice by Zed_Lopez (Chaplain) on Jan 05, 2004 at 20:32 UTC
And, do you only ever care about a single day at a time? I'd be tempted to load all the data into a hash, then use a tie interface that computes what the closest preceding data point would be and returns the value corresponding to it. As with most questions of this sort, the proof is in the benchmarking, and if you really want to optimize for search speed, you'll want to benchmark.	[reply]
Re: Re: Re: Data structure advice by robinbowes (Beadle) on Jan 05, 2004 at 20:47 UTC
Hi Zed, I do only care about a single day at a time and, to answer a question from a previous reply, I will be loading all 48 values before processing the many thousands of data points which require the look up. The tie interface looks like what I need; I'll have a scout around for information on that. Can you point me at any suitable examples? Cheers, R. -- Robin Bowes \| http://robinbowes.com	[reply]
Re: Re: Re: Re: Data structure advice by Zed_Lopez (Chaplain) on Jan 05, 2004 at 22:01 UTC
Re: Data structure advice by bart (Canon) on Jan 05, 2004 at 19:40 UTC
I would think of the next two approaches: Binary search. For example, you could use a binary tree, or a B-tree, or anything similar, as the data structure. What it should do, if it doesn't find a key in the tree, is return the record for the next smaller value. Make sure the dates are easily sortable, for example by using the ISO-8601 combined date/time format: "20031012T132000". This setup allows for variable gaps between measurement times. I'm not sure what Perl modules are available for this means... Keywords to search for include tree, trie, heap. Since your times are always exactly 30 minutes apart, just convert it to epoch times, subtract some starting value, divide by 30*60 and use the resulting value as an integer array index, in a normal array.	[reply]