Re^2: Invert a hash... not a FAQ (I hope)

Thanks! I'll give the snipped above a go.

Here is an example of what I'm doing. I'm downloading very detailed information on the wholesale price of electricity. In my market, there is an individual price for each of several thousand nodes in the market, and there is a separate price for each hour, and in the "real-time" market, for every five minutes.

The data comes in flat CSV files for each day that look like

node,hour,interval,price
bob,3,4,45.64
...
[download]

I frequently need to generate reports from this data such as:

what is the price of energy at each hour averaged over all nodes (or weighted over nodes more heavily traded)?
what is the price of energy at each node, averaged over all hours (or weighted over the hours more heavily traded)?
what is the highest price observed in hour 16 over the last 5 weekdays?
what day had the greatest hourly variation among the following n nodes in the past three months?
various charts, graphs, histograms, maps, and statistical analyses, all generated across different "axes" of space and time.

In general, what I do is load up the data I'm interested in and plop it into a hash of hashes of hashes ... that is organized most conveniently for the burning management question du jour, but then the next day, a different question would be easier to calculate with a different arrangement. Often, these analyses get accreted into a daily report generated by one script, but I'm trying to minimize the amount of shuffling going on.

Run-time is an issue, but size is usually a bigger problem.

dave

Comment on Re^2: Invert a hash... not a FAQ (I hope) Download Code

Replies are listed 'Best First'.

Re^3: Invert a hash... not a FAQ (I hope)
by tilly (Archbishop) on Jan 22, 2009 at 01:43 UTC

At first that may feel restrictive. But with some experience you'll likely find that the SQL for a report is a lot shorter and less error-prone than attempting to produce that same dataset yourself in code.

[reply]

Re^4: Invert a hash... not a FAQ (I hope)

by djacobow (Initiate) on Jan 22, 2009 at 02:01 UTC

Yeah a database is an obvious solution that I am resisting just for pure stubbornness. (and because I am dreading the learning curve for DBI, SQL, etc)

I've also conjectured, too, that the "db that is the filesystem", indexed by date in my case, will be faster than a "real" database particularly considering that *usually* I can get all the information I need for a given subset of days in memory at once without a problem.

I'll be sad, though if after all the trouble, DBI ends up being slower.

[reply]

Re^5: Invert a hash... not a FAQ (I hope)

by tilly (Archbishop) on Jan 22, 2009 at 02:24 UTC

use DBI;
my $dbh = DBI->connect(
  "dbi:Pg:host=$host;database=$database",
  $user,
  $password,
  {
    AutoCommit => 0,
    RaiseError => 1,
  },
) or die "Can't connect: $DBI::errstr";

my $data = $dbh->selectall_arrayref(qq{
  SELECT MAX(price) as max_price
  FROM data_log
  WHERE price_date > now()::date - 5
    AND to_char(price_date, 'HH24') = '06'
}) or die "Cannot prepare: $DBI::errstr";

print $data->[0][0];
[download]

DBI

manual

[reply]
[d/l]

Re^5: Invert a hash... not a FAQ (I hope)

by Jenda (Abbot) on Jan 22, 2009 at 02:22 UTC

Two not so obvious advices.

1. batch the import. If the database you decide has some batch import tools able to handle your format, use them. Otherwise turn off AutoCommit when creating the DBI object and commit only once every thousand (ten thousands? ... that depends) records. This will speed the import up quite a bit.

2. make sure you define indexes on your tables. Not too few and not too many. If the database you choose allows you to see the "estimated execution plan" of the query generating the report, use that and make sure it doesn't use "table scans" on tables you only need a few rows from etc. Don't be afraid to play with this a bit, create an index, see what it does with the estimated execution plan and estimated price and see how long does it run ...

You may spend a lot of time with this at first, but as the management starts inventing more and more reports that they'd like, it will pay off.

Jenda
Support Denmark!
Defend the free world!

[reply]

Re^6: Invert a hash... not a FAQ (I hope)

by jhourcle (Prior) on Jan 22, 2009 at 15:10 UTC

Re^3: Invert a hash... not a FAQ (I hope)
by kyle (Abbot) on Jan 22, 2009 at 01:59 UTC

When faced with a similar situation, I'd build all the data structures I'd want concurrently as I was reading my input. That was before I knew how to use a database.

Looking at your situation now, I'd say load any given data set into a database (PostgreSQL, SQLite, MySQL) with DBI and then query it for management's burning questions. If you're not familiar with SQL already, you might not see the advantage of this right away, but databases are already designed to do this kind of work on data too big to fit in memory.

[reply]