Re^2: Best way to store/sum multiple-field records?
by bobdabuilda (Beadle) on Dec 23, 2014 at 00:35 UTC
|
Thanks GrandFather and toolic (I did reply to yours also, but no idea where the reply went... actually, now that I am about to post this, I suspect I may have only previewed my response to you and missed the "create" step).
Both of those look good, and I should be able to work my way through them to sort out where I went wrong and help him get what he's after. Very much appreciated.
Thanks also for the tips, GrandFather. One of the things I actually had in my vapourised response to toolic, was a query about the declaration of the variables, and the fact he'd moved them into the loop. I would have thought that declaring variables over and over would be less efficient than declaring them once at the start - that was my reasoning for declaring them before the loop, anyways...
Obviously, from your comments, that's not the case. Will remember that for future, thanks.
| [reply] |
|
|
Never code for "efficiency". Instead code for clarity and maintainability. Compared with almost anything else your code does, declaring variables takes no time at all. Even if it took a huge amount of time like 1/1000 of a second, that is still tiny compared to the time it takes to read a line from disk and process it. And even if it represented a large portion of the time for each loop iteration, unless you are processing thousands of lines that overhead just isn't noticeable. In practice the overhead is likely to be much less than 1 millionth of a second and nothing to worry about ever.
Just remember: premature optimization is the root of all evil.
Perl is the programming world's equivalent of English
| [reply] |
|
|
I would have thought that declaring variables over and over would be less efficient than declaring them once at the start - that was my reasoning for declaring them before the loop, anyways...
Not only 'premature optimization is the root of all evil'; not only such a microoptimization is completely meaningless; but it's actually the other way around... declaring variables inside a loop is quite a bit faster. I guess due to Perl's own optimizations...
use strict;
use warnings;
use Benchmark qw( cmpthese );
my @strings = qw(
USERID1|2215|Jones|
USERID1|1000|Jones|
USERID3|1495|Dole|
USERID2|2500|Francis|
USERID2|1500|Francis|
);
cmpthese(
1_000_000,
{
outside => sub {
my ( $x, $y, $z );
for (@strings) {
( $x, $y, $z ) = split /\|/;
}
},
inside => sub {
for (@strings) {
my ( $x, $y, $z ) = split /\|/;
}
}
}
);
result:
Rate outside inside
outside 109890/s -- -38%
inside 176678/s 61% --
| [reply] [d/l] [select] |
|
|
To speed up split, specify the number of elements:
($x, $y, $z) = split /\|/, $_, 3;
| [reply] [d/l] |
|
|
use strict;
use warnings;
use Benchmark qw( cmpthese );
my @strings = qw(
USERID1|2215|Jones|
USERID1|1000|Jones|
USERID3|1495|Dole|
USERID2|2500|Francis|
USERID2|1500|Francis|
);
cmpthese(
-1,
{
outside => sub {
my ( $x, $y, $z );
for (@strings) {
( $x, $y, $z ) = split /\|/;
}
},
outside2 => sub {
my ( $x, $y, $z );
for (@strings) {
( $x, $y, $z ) = split /\|/, 3;
}
},
inside => sub {
for (@strings) {
my ( $x, $y, $z ) = split /\|/;
}
},
inside2 => sub {
for (@strings) {
my ( $x, $y, $z ) = split /\|/, 3;
}
},
}
);
__END__
C:\test>junk
Rate outside inside inside2 outside2
outside 58201/s -- -38% -71% -73%
inside 93659/s 61% -- -53% -57%
inside2 197610/s 240% 111% -- -10%
outside2 218802/s 276% 134% 11% --
When you can explain that; then you may pontificate on the subject.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
|
|
|
|
|
|
|
|
|
Re^2: Best way to store/sum multiple-field records?
by bobdabuilda (Beadle) on Dec 23, 2014 at 00:48 UTC
|
Actually, I just noticed the "double up" of the names in the output - would that best be counteracted by another "defined" check along the lines of:
if (! defined $sum{$key}{reason})
{
push @{$sum{$key}{reason}}, $reason;
}
| [reply] [d/l] |
|
|
Did you try it? Did it work?
How do you expect "Jones,Tom/Jones, Tom" to be handled?
Aside from the formatting issue, I'd be inclined to use a nested hash instead of an array if you want to suppress duplicates.
Perl is the programming world's equivalent of English
| [reply] |
|
|
I've had a look over both yours and toolics' solutions, and decided to stick with yours, as I found dealing with the output of your version a lot easier - not too sure how to get the output from toolics' version into a nice pipe-delimited format. Had a poke around on Google and here, and couldn't find anything that my mind attached itself to as a reasonable solution.
So, after having a bit more of a play, this is what I've ended up with. Could you take a look, and see if you can see anything wrong with it, please?
#!/usr/bin/perl
use strict;
use warnings;
my %sum;
while (<DATA>) {
chomp;
my ($key, $value, $reason) = split(/\|/);
if (! defined $reason || $value !~ /^\d+$/) {
warn qq<dropped line: "$_"\n>;
next;
}
$sum{$key}{value} += $value;
if (! defined $sum{$key}{reason})
{
$sum{$key}{reason} = $reason;
}
}
local $" = '/';
for my $key (keys %sum) {
print "$key|$sum{$key}{value}|$sum{$key}{reason}\n";
}
__DATA__
USERID1|2215|Jones,Tom|
USERID1|1000|Jones, Tom|
USERID3|1495|Dole, Bob|
USERID2|2500|Francis, Pope|
USERID2|1500|Francis, Pope|
USERID4|0045|Doe, John|
USERID5|1225|Doe, Jane|
USERID4|4995|Doe, John|
USERID4|9995|Doe, John|
USERID6|1095|Darwin, Anita|
USERID7|1495|Dawson, Gary|
USERID6|1250|Darwin, Anita|
Prints:
USERID5|1225|Doe, Jane
USERID3|1495|Dole, Bob
USERID7|1495|Dawson, Gary
USERID1|3215|Jones,Tom
USERID4|15035|Doe, John
USERID2|4000|Francis, Pope
USERID6|2345|Darwin, Anita
| [reply] [d/l] [select] |
|
|
|
|
I did, and it did. The main reason I was asking, was in case there was something wrong with doing it that way - just because it seems to work, doesn't mean it's "right", so I wanted to make sure I wasn't doing something wrong...
As for your question about the handling of the different formatting of the names - that shouldn't be an issue, as the source of these data he's using is a database where the data will be consistent for each user ID.
| [reply] |