Re: How to improve this data structure?
by davido (Cardinal) on May 21, 2013 at 14:50 UTC
|
Are the region numbers sequential, or discontiguous?
If sequential:
my @StatsArray;
push @{$StatsArray[$RegionNum]}, {
AR => $AR[$RegionNum],
BCR => $BCR[$RegionNum],
};
If sparse, this:
my %Stats;
push @{$Stats{$RegionNum}}, {
...
};
Then you could do this:
my @sorted = sort { $a->{AR} <=> $b->{AR} } @{ $StatsArray[$RegionNum]
+ };
...for example.
It might be that you're getting to the point where a database would scale better though. Another approach might be to just build a binary tree, maintaining nodes in sorted order. This allows for relatively inexpensive inserts and searches. But it does sound like you might need the scalability of a DB approach.
| [reply] [d/l] [select] |
|
|
Agreed on the database -- you also then have the advantage that you can ingest the numbers once, and then run whatever processing needs to be done on it, iterating as you go.
| [reply] |
|
|
Most excellent: runtime reduction of 70%! BTW, there was a typo in the original post. Need...@{$StatsArray{$RegionNum}}
...no square brackets. Region numbers are sequential. Definitely need to think about building a DB, though...thanks for the help.
| [reply] [d/l] |
|
|
If they're sequential, it should be @StatsArray, in which case, @{$StatsArray[$RegionNumber]} would be appropriate, and probably even a little faster since array index lookups require less constant-time to achieve than hash lookups.
Here's a sort of loose and dirty explanation of why you get such a good speedup here. Let's assume that your original @StatsArray had 1_000_000 entries, and that there are ten regions, each of which has 100_000 entries.
Your original approach was sorting 1_000_000 entries. Sort is an O(n log n) operation, so we can say that there were approximately 1M * log(1M) units of work going on.
The grep approach helps because grep is an O(n) operation. So you walk through the million item list one time, and pull out 100_000 entries. Then you sort the 100_000 entries. So you have 1M + ( 100K * log(100K) ) units of work, approximately.
My approach eliminates the need for the grep. So you do away with the "1M" units of work, and are left with 100K * log(100K) units of work.
This is really a rough approximation of what's going on, but fits fairly well, and I think should help to explain why you see such an improvement.
The database approach would still scale better, so that you don't have to rewrite the code when 1_000_000 entries becomes 100_000_000. ;)
| [reply] [d/l] |
|
|
|
|
Most excellent: runtime reduction of 70%! BTW, there was a typo in the original post: need @{$StatsArray{$RegionNum}} (no square brackets). Region numbers are sequential.
| [reply] |
Re: How to improve this data structure?
by choroba (Cardinal) on May 21, 2013 at 14:41 UTC
|
A simple fix which might help a bit is to grep the array before running sort on it:
for (sort {
$a->{RegionNum} <=> $b->{RegionNum}
or
$a->{AR} <=> $b->{AR}
} grep $_->{RegionNum} == $RegionNum, @StatsArray) {
$ARKey = $_->{AR};
# do stuff, no need to check the RegionKey.
}
Untested.
| [reply] [d/l] |
|
|
Thanks a lot...this simple grep reduced runtimes by about 35%.
| [reply] |
Re: How to improve this data structure?
by locked_user sundialsvc4 (Abbot) on May 21, 2013 at 15:38 UTC
|
++ on using a database
SQLite files are wonderful for this sort of thing ... there is no “server.” There are useful utilities for importing text-files and so forth. The only gotcha, when you finally get down to programming that updates things, is a rather important one: use transactions. SQLite is specifically designed to commit data to disk and to re-read the data to verify it, every single time, unless a transaction is in-progress. (If there is, it buffers the data much more sensibly.) But, with that being said, it is an excellent and robust tool ... and free. (It is actually in the public domain.)
You’ve got millions of records to deal with, and you can’t be writing a Perl program every single time . . . You might find need to get to this data in all sorts of ways – reports, spreadsheets, who-knows. SQLite can take you there.
| |
|
|
| [reply] |
|
|
Yeah, more or less.
“A million records” is a volume that is “vaguely interesting” to SQLite ... it will take a few minutes’s one-time cost to import the data (and might not require a program). A couple minutes more to add some indexes. From that point on, you can use any tool or combination of tools that has a DB-interface (including Perl of course ...) to get the job done, and now the query-engine is the one that’s doing all the heavy lifting. So long as the “transactions” caveat is carefully adhered-to esp. when doing updates, it’s really quite a remarkable piece of software engineering. (It’s rare when a piece of software genuinely surprises me blows me away. Perl/CPAN did it. So did this.) I suspect that most of the things that the O.P. is right now “writing programs to do” can probably be reduced to a query (and perhaps a now-trivial program to digest the results). Furthermore, a huge bonus is that you can put results into a different table, which of course is a self-describing data structure. (A single SQLite file can contain any number of tables.)
| |
Re: How to improve this data structure?
by BrowserUk (Patriarch) on May 21, 2013 at 19:11 UTC
|
push @{ %statHash{ $RegionNum } }, { AR=>$AR[$RegionNum], BCR=>$BCR[$R
+egionNum], ... };
You only need to sort the subset of hashes in each region (and avoid the grep).
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: How to improve this data structure?
by hdb (Monsignor) on May 21, 2013 at 14:47 UTC
|
From your code snippet it looks to me as if the AR (and BCR) entry is always the same give a $RegionNum. This can't be correct. Can you clarify?
| [reply] |
|
|
Some pre-processing happens before @StatsArray is populated, so the AR and BCR values (and others) are always unique.
| [reply] |
Re: How to improve this data structure?
by space_monk (Chaplain) on May 21, 2013 at 23:14 UTC
|
Everyone else has offered good advice on the array processing, however it may also be worth asking how you are reading the data from the file. At one million lines of data, improvements in transferring the data from file to memory may also cut your runtime
If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)
| [reply] |
Re: How to improve this data structure?
by TomDLux (Vicar) on May 22, 2013 at 20:13 UTC
|
If you want to preserve the data, a database is definitely an interesting solution. I haven't tried using SQLLite and similar when I have a transient need for data --- traditional extract / transform / load situations.
If you know the number of region numbers is significantly less than the number of records, you could use multiple StatsArrays, one for each region number. Since you don't know how many there will be, the obvious solution is to use an array of arrays ... the outer array for the region, the inner array for records in that region. I only wish your regions were numbered 0, 1, 2, 3, ... but that's stretching things a bit, so lets just use a hash that maps region number to array element.
I haven't run it, so there may be bugs. When a new RegionNum comes along, you detect it isn';t in the hash, so you add a new entry with the next array slot, and provide an anonymous array for the records to go in. Then, when you need the records sorted, it is far easier to sort the subset with the same regionNum. Sorting NxM log(M) will be faster than sorting NM log(NM).
package RegionRecords;
my %regionNums;
my @all_regions;
my $N = 0;
sub add {
my ( $record ) = @_;
if ( ! exists $regionNums{$record->{RegionNum} }) {
$regionNums{$record->{RegionNum} } = $N;
$all_regions[$N] = [];
$N++;
}
push $all_regions->[$regionNums{$record->{RegionNum} }], $record;
}
1;
package main;
RegionRecord::add( {RegionNum=>$RegionNum,
AR=>$AR[$RegionNum],
BCR=>$BCR[$RegionNum]}
);
print RegionRecord::allRecords();
As Occam said: Entia non sunt multiplicanda praeter necessitatem.
| [reply] [d/l] |