The_Dj has asked for the wisdom of the Perl Monks concerning the following question:
Hi, Monks.
For reasons probably best described as DBI, I have data like this:
%dat_by_id = (
1=>{
id=>1, sn=>'a', more=>'foo'
},
2=>{
id=>2, sn=>'b', more=>'bar'
}
);
You get the idea.
Now, for reasons best described as argh! I also do:
%dat_by_sn=(map { $_->{sn} => $_ } values %dat_by_id);
Sadly I do need to reference entries be either field throughout the rest of the code.
Is there a better way? (My dataset runs to a few million records. Big but not too big.)
I don't (currently) need a 3rd,4th,...nth field, but it's only a matter of tiem untill someone adds a UID field, or something else to compicate my life
Re: Too Many IDs
by kcott (Archbishop) on Jan 09, 2020 at 08:21 UTC
|
G'day The_Dj,
Although not stated, I'm assuming all id, sn, etc. values are unique.
If that's not the case, neither your current solution nor my alternative suggestion will work properly.
Instead of recreating the entire hash multiple times, consider just having a single hash with all the data
and then simple mappings of sn to id
(extending for future requirements).
Here's a quick example:
#!/usr/bin/env perl
use strict;
use warnings;
my %dat_by_id = (
1 => {id=> 1, sn => 'a', more => 'foo'},
2 => {id=> 2, sn => 'b', more => 'bar'},
);
my %map_sn_to_id = map +($_->{sn} => $_->{id}), values %dat_by_id;
print "SN for ID[1]: $dat_by_id{1}{sn}\n";
print "ID for SN[b]: $dat_by_id{$map_sn_to_id{b}}{id}\n";
print "MORE for SN[a]: $dat_by_id{$map_sn_to_id{a}}{more}\n";
# Subsequent requirements, e.g.
my %map_more_to_id = map +($_->{more} => $_->{id}), values %dat_by_id;
print "ID for MORE[foo]: $dat_by_id{$map_more_to_id{foo}}{id}\n";
print "SN for MORE[bar]: $dat_by_id{$map_more_to_id{bar}}{sn}\n";
Output:
SN for ID[1]: a
ID for SN[b]: 2
MORE for SN[a]: foo
ID for MORE[foo]: 1
SN for MORE[bar]: b
Having a single data source will reduce the chances of errors
and should make maintenance and debugging (if necessary) easier.
I see you've used "map BLOCK LIST" and I'm aware that's considered a Best Practice;
however, "map EXPR, LIST" is faster and may make a difference,
especially when you're dealing with millions of data elements.
Use Benchmark to test.
See map for more on these two forms
as well as an explanation of the unary plus, "map +(...", I used
(if you're unfamiliar with that syntax).
I've only shown a barebones technique.
For production usage, I'd suggest setting up a series of functions,
e.g. get_id_for_sn($sn),
instead of having to continually hard-code an equivalent $dat_by_id{$map_sn_to_id{$sn}}{id}.
| [reply] [d/l] [select] |
|
Hi, Ken.
Yeah, I should have mentioned that both sn and id are* unique keys.
And thanks for the benchmark pointers. I'll keep this in mind when I get to optimizing.
Sadly I tend to loose most of my runtime to DB calls and some 3rd party .exe's, but I will give map a close scrutiny.
* Except one isn't really, but is treated as such for other reasons. My job can be 'fun'
| [reply] [d/l] [select] |
|
If parts of your application (DB, EXEs, etc.) are taking seconds or milliseconds to run,
spending time optimising map to save a few micro- or nanoseconds is unlikely to be worth the effort.
Another consideration is whether this is a short-lived application that's run multiple times
or a long-lived application that's run with multiple iterations.
There are some JIT (just-in-time) possibilities that may be worth consideration.
If you end up with a lot of get_X_for_Y() subs,
and some are only called infrequently,
you can create mappings just when they're needed; something like this using
state (which requires v5.10):
{
my %dat_by_id = ...;
sub get_X_for_Y {
my ($Y) = @_;
state $map_Y_to_id = {
map +($_->{Y} => $_->{id}), values %dat_by_id
};
return $dat_by_id{$map_Y_to_id->{$Y}}{X};
}
sub get_X_for_Z { ... }
sub get_V_for_W { ... }
...
}
Note that the anonymous block gives %dat_by_id lexical scope such that it is only visible to the subs;
while the subs themselves are visible to, and accessible from, the entire script.
This prevents inadvertent changes to %dat_by_id which could introduce bugs which are hard to track down.
Also note that the placement of the above code within your script would be important.
The assignment to %dat_by_id should occur before any calls to get_?_for_?() are made.
An alternative to this would be to use some combination of BEGIN, INIT, etc. blocks;
I don't know enough about your code to comment further on this; see
perlmod: BEGIN, UNITCHECK, CHECK, INIT and END
for more about these.
There would, no doubt, be other ways to handle this but, as previously stated,
I don't know enough about your code to offer further advice.
| [reply] [d/l] [select] |
|
Re: Too Many IDs
by choroba (Cardinal) on Jan 09, 2020 at 08:08 UTC
|
Update: This answer isn't really helpful, sorry. Your solutions stores the same references under the sn's, so it should occupy memory comparable to my solution.
If your concern is memory, you can store just the "sn to id" map instead of storing all the data twice.
my %sn_to_id = map { $dat_by_id{$_}{sn} => $_ } keys %dat_by_id;
You then need to nest the references to get the value:
my $sn = 'b';
my $more = $dat_by_id{ $sn_to_id{$sn} }{more};
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
Where is the benefit to the OP's code, where the values are just the identical hash refs and no flat copies?
| [reply] |
Re: Too Many IDs
by LanX (Sage) on Jan 09, 2020 at 11:34 UTC
|
> Is there a better way?
Depends, there are certainly cases where it was certainly better to keep the raw table in the DB and you query ID or SN individually via SQL.
| [reply] |
|
"Depends" is about as good of an answer as I could think of, too. (Whenever someone uses a comparative like "better", I usually insist on "better, for what?" if I'm going to spend time answering, and start thinking "is this an X-Y question?", etc.) I would say that I don't usually use selectall_hashref like that. Often I'd go with selectall_arrayref($sql, {Slice => {}}, @bind) so you iterate over the hrefs directly to build the aggregations (hashes). And end up, like you say, splitting things up and doing subqueries. Depends..
| [reply] [d/l] [select] |
|
Yep.
I will traverse all the data
The data lives on another server
Pulling it all at once is just faster*
I acutlly have both selectall_hashref and selectall_arrayref($sql, {Slice => {}}) in my code, each where it is best (I believe)
* I should probably benchmark that
| [reply] [d/l] [select] |
|
|
|
|
|
|
|