Re: How to improve this data structure?

Are the region numbers sequential, or discontiguous?

If sequential:

my @StatsArray;
push @{$StatsArray[$RegionNum]}, {
  AR  => $AR[$RegionNum],
  BCR => $BCR[$RegionNum],
};
[download]

If sparse, this:

my %Stats;
push @{$Stats{$RegionNum}}, {
  ...
};
[download]

Then you could do this:

my @sorted = sort { $a->{AR} <=> $b->{AR} } @{ $StatsArray[$RegionNum]
+ };
[download]

...for example.

It might be that you're getting to the point where a database would scale better though. Another approach might be to just build a binary tree, maintaining nodes in sorted order. This allows for relatively inexpensive inserts and searches. But it does sound like you might need the scalability of a DB approach.

Dave

Comment on Re: How to improve this data structure? Select or Download Code

Replies are listed 'Best First'.
Re^2: How to improve this data structure? by jhourcle (Prior) on May 21, 2013 at 14:56 UTC
Agreed on the database -- you also then have the advantage that you can ingest the numbers once, and then run whatever processing needs to be done on it, iterating as you go.	[reply]
Re^2: How to improve this data structure? by fiddler42 (Beadle) on May 21, 2013 at 21:11 UTC
Most excellent: runtime reduction of 70%! BTW, there was a typo in the original post. Need... `@{$StatsArray{$RegionNum}}` ...no square brackets. Region numbers are sequential. Definitely need to think about building a DB, though...thanks for the help.	[reply] [d/l]
Re^3: How to improve this data structure? by davido (Cardinal) on May 21, 2013 at 21:25 UTC
If they're sequential, it should be @StatsArray, in which case, `@{$StatsArray[$RegionNumber]}` would be appropriate, and probably even a little faster since array index lookups require less constant-time to achieve than hash lookups. Here's a sort of loose and dirty explanation of why you get such a good speedup here. Let's assume that your original @StatsArray had 1_000_000 entries, and that there are ten regions, each of which has 100_000 entries. Your original approach was sorting 1_000_000 entries. Sort is an O(n log n) operation, so we can say that there were approximately 1M * log(1M) units of work going on. The grep approach helps because grep is an O(n) operation. So you walk through the million item list one time, and pull out 100_000 entries. Then you sort the 100_000 entries. So you have 1M + ( 100K * log(100K) ) units of work, approximately. My approach eliminates the need for the grep. So you do away with the "1M" units of work, and are left with 100K * log(100K) units of work. This is really a rough approximation of what's going on, but fits fairly well, and I think should help to explain why you see such an improvement. The database approach would still scale better, so that you don't have to rewrite the code when 1_000_000 entries becomes 100_000_000. ;) Dave	[reply] [d/l]
Re^4: How to improve this data structure? by fiddler42 (Beadle) on May 21, 2013 at 22:10 UTC
My apologies: when the data file is parsed and the @StatsArray is populated, region numbers will be totally random. So the region numbers are sparse, hence `@{$StatsArray{$RegionNumber}}` works. (Each region is ultimately processed sequentially after I am done with the data file.) Thanks for the explanation, too. I have found hashes of arrays of hashes a little confusing in the past, but I finally have a good, functional example to leverage for future efforts.	[reply] [d/l]
Re^2: How to improve this data structure? by Anonymous Monk on May 21, 2013 at 21:06 UTC
Most excellent: runtime reduction of 70%! BTW, there was a typo in the original post: need @{$StatsArray{$RegionNum}} (no square brackets). Region numbers are sequential.	[reply]