djacobow has asked for the wisdom of the Perl Monks concerning the following question:

First, this is not about the standard answer about inverting a hash using 'reverse' and how non-unique values clobber each other in the new hash. I'm writing about efficiency. I frequently find myself using code that looks like this:
sub invert { my $x = shift; my $y = {}; foreach my $xk1 (keys %$x) { foreach my $xk2 (keys %{$x->{$xk1}}) { $y->{$xk2}{$xk1} = $x->{$xk1}{$xk2}; } } return $y; }
Basically, I'm just rearranging a hash, for example, if I have a hash of 'days' of 'hours' of 'something', and I want to rearrange into 'hours' of 'days' of something. Is there a better/more perly way to do this? A function call to a c-compiled call? Something that does not require foreaches to essentially claw through every element in perl. I'm doing this a lot, on some very large data structures. It's not particularly slow, but still... Also, I'm trying to minimize unecessary copying. The code above at least only copies references, if that's what exist past the first two levels, but no copying at all would be best.

Replies are listed 'Best First'.
Re: Invert a hash... not a FAQ (I hope)
by tilly (Archbishop) on Jan 22, 2009 at 01:08 UTC
    If you care about performance, the following avoids some hash lookups and so should perform better:
    sub invert { my $x = shift; my $y = {}; while (my ($key1, $value1) = each %$x) { while (my ($key2, $value2) = each %$value1) { $y->{$key2}{$key1} = $value2; } } return $y; }
    I don't think you can get much faster than this in pure Perl.

    If you want to speed this up, save memory, etc, then you have to look into why you find the need to invert multi-level hashes regularly. Since you haven't shown an example of a design problem where you find yourself wanting to do that, I can't comment on whether your design is appropriate or whether there is an alternate way to tackle those problems that would work better.

    In short I can think of some situations where inverting a hash is absolutely the right thing to do. I can think of others where there is no real need to do it. Without more context, there is no point in my guessing which applies to you.

      Thanks! I'll give the snipped above a go.

      Here is an example of what I'm doing. I'm downloading very detailed information on the wholesale price of electricity. In my market, there is an individual price for each of several thousand nodes in the market, and there is a separate price for each hour, and in the "real-time" market, for every five minutes.

      The data comes in flat CSV files for each day that look like

      node,hour,interval,price bob,3,4,45.64 ...

      I frequently need to generate reports from this data such as:

      • what is the price of energy at each hour averaged over all nodes (or weighted over nodes more heavily traded)?
      • what is the price of energy at each node, averaged over all hours (or weighted over the hours more heavily traded)?
      • what is the highest price observed in hour 16 over the last 5 weekdays?
      • what day had the greatest hourly variation among the following n nodes in the past three months?
      • various charts, graphs, histograms, maps, and statistical analyses, all generated across different "axes" of space and time.

      In general, what I do is load up the data I'm interested in and plop it into a hash of hashes of hashes ... that is organized most conveniently for the burning management question du jour, but then the next day, a different question would be easier to calculate with a different arrangement. Often, these analyses get accreted into a daily report generated by one script, but I'm trying to minimize the amount of shuffling going on.

      Run-time is an issue, but size is usually a bigger problem.

      dave

        The alternate design that I would use is to dump the dataset in a database then query the database multiple times.

        At first that may feel restrictive. But with some experience you'll likely find that the SQL for a report is a lot shorter and less error-prone than attempting to produce that same dataset yourself in code.

        When faced with a similar situation, I'd build all the data structures I'd want concurrently as I was reading my input. That was before I knew how to use a database.

        Looking at your situation now, I'd say load any given data set into a database (PostgreSQL, SQLite, MySQL) with DBI and then query it for management's burning questions. If you're not familiar with SQL already, you might not see the advantage of this right away, but databases are already designed to do this kind of work on data too big to fit in memory.

Re: Invert a hash... not a FAQ (I hope)
by repellent (Priest) on Jan 22, 2009 at 03:03 UTC
    If you find yourself "rearranging" your "hash" many times over, you're threading into the realm of relational databases. That is the efficient solution. Many have offered advice.

    You may also want to look into what "Logic Programming" means, as a gentle intro into the fuss that is relational databases. Good read.
Re: Invert a hash... not a FAQ (I hope)
by scorpio17 (Canon) on Jan 22, 2009 at 14:43 UTC

    I also recommend using a database.

    I've done something similar myself. In my case, I have a cron job run once per hour, fetching new data from the source, and loading it into the database (any pre-processing that needs to be done can be handled at this point).

    For a while, I had cgi scripts generate reports on demand. But now I've started caching the "report pages". A cgi script runs, and if the report exists, it slurps it in and dumps it to the browser. If it does NOT exist, it creates it (so it will be there next time), then dumps it to the browser. This helps reduce the number of database hits when lots of people are looking at the same report.

    Note that the output doesn't have to be html - I could also generte reports as PDF files, etc.