anomilie has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to make a hash unique with the following code, is there a way I can optimise this subroutine? The hash has a record_entry identifier as key and the data items are stored in an array, it looks like this:

%comboHashRef = { record_entry <= [0,store variable 1,store variable 2 +, ... product ] record_entry <= [0,product variable1 ,product variab +le2 , ... product ] record_entry <= [1,product variable1 ,product variab +le2 , ... product ] ... }

Using the 2 store variables to specify a unique location, I want to get the unique products to write the whole array for that record_entry to an outfile and count the number of unique, duplicated and total products This needs to be done separately for products that are purchased (1) or not (0). Any thoughts? hidden in-build functions I'm not aware of?

sub uniqify { my $outoutfile = $_[0]; # index of where product is stored in array (can vary) my $pid = $_[1]; my %comboHashRef = %{$_[2]}; my %precords; my @dedup_purchase=(); my @dedup_notpurchase=(); my $product_key; ## loop through hash foreach my $key (keys %comboHashRef){ ## if purchased if ($comboHashRef{$key}->[0] == '1'){ ## create key based on store identifiables my $product_key = $comboHashRef{$key}->[1]."_".$comboHashRef +{$key}->[2]."0"; ## if produce key exists if ($product_key ~~ @dedup_purchase) { ## if product already seen if ($comboHashRef{$key}->[$pid] ~~ $precords{$product_ke +y}{"dedup"}){ $precords{$product_key}{"dup_count"}++; $precords{$product_key}{"total_count"}++; next; ## if product not already seen }else{ push(@{$precords{$product_key}{"dedup"}}, $comboHash +Ref{$key}->[$pid]); $precords{$product_key}{"dedup_count"}++; $precords{$product_key}{"total_count"}++; print $outfile $key ."\t" , (join("\t", @$_), "\n") +for $comboHashRef{$key}; next; } ## if product does not exists }else { push(@dedup_purchase, $product_key); ## create hash for tracking product push(@{$precords{$product_key}{"dedup"}}, $comboHashRef{ +$key}->[$pid]); $precords{$product_key}{"dedup_count"}++; $precords{$product_key}{"total_count"}++; print $outfile $key ."\t" , (join("\t", @$_), "\n") for +$comboHashRef{$key}; next; } ## if not purchased }if ($comboHashRef{$key}->[0] == '0'){ ## create key based on store identifiables my $product_key = $comboHashRef{$key}->[1]."_".$comboHashRef +{$key}->[2].'_-'; ## if key exists if ($product_key ~~ @dedup_notpurchase) { ## if product key aleady seen if ($comboHashRef{$key}->[$pid] ~~ $precords{$product_ke +y}{"dedup"}){ $precords{$product_key}{"dup_count"}++; $precords{$product_key}{"total_count"}++; next; }else{ ## append to hash for tracking product push(@{$precords{$product_key}{"dedup"}}, $comboHash +Ref{$key}->[$pid]); $precords{$product_key}{"dedup_count"}++; $precords{$product_key}{"total_count"}++; print $outfile $key ."\t" , (join("\t", @$_), "\n") +for $comboHashRef{$key}; next; } ## if product does not exists }else{ push(@dedup_notpurchase, $product_key); ## create hash for tracking product push(@{$precords{$product_key}{"dedup"}}, $comboHashRef{ +$key}->[$pid]); $precords{$product_key}{"dedup_count"}++; $precords{$product_key}{"total_count"}++; print $outfile $key ."\t" , (join("\t", @$_), "\n") for +$comboHashRef{$key}; next; } } } return %precords; }

Replies are listed 'Best First'.
Re: Optimise Sub, make hash unique based on values in array
by BrowserUk (Patriarch) on Aug 21, 2015 at 05:49 UTC
    is there a way I can optimise this subroutine?

    You'll need to describe what your subroutine does a whole lot more clearly before anyone will be able to offer any major algorithmic efficiency changes.

    For example: "Using the 2 store variables to specify a unique location, I want to get the unique products to write the whole array for that record_entry to an outfile and count the number of unique, duplicated and total products"; how does having "a unique location" allow you to "get the unique products"?

    In the mean time, there are some very obvious implementation efficiencies that you could make.

    1. You pass in a hash reference, but then duplicate the entire hash it refers to:
      my %comboHashRef = %{$_[2]};

      You should avoid doing that by accessing the referenced hash indirectly. Ie. Instead of:

      my %comboHashRef = %{$_[2]}; ... foreach my $key (keys %comboHashRef){

      Do:

      my $comboHashRef = $_[2]; ... foreach my $key (keys %{ $comboHashRef } ){

      Or even better, avoid the generation of a large list and do:

      my $comboHashRef = $_[2]; ... while( my( $key, $value ) = each %{ $comboHashRef } ){
    2. Ditto, once you've built up %precords;, you return it thus:
      return %precords;

      And what that does is:

      1. Takes the hash you've so carefully constructed.
      2. flattens it to a big list on the stack;
      3. then garbage collects the entire hash;
      4. returns to the caller, where the big list on the stack is then reconstructed back to a new hash;
      5. And then the big list you constructed on the stack has to be garbage collected.

      Whereas, if your simply returned a reference to the hash you built internally, you avoid all that deconstruction and reconstruction.

      By way of example of the difference this can make, here is a deceptively simple benchmark:

      sub yourWay{ my %h = %{ $_[0] }; return %h; };; sub myWay{ my $ref = shift; return $ref };; %hash = ( 1 .. 1e6 );; cmpthese -5,{ a=>q[ my %h = yourWay( \%hash ); ], b=>q[ my $h = myWay( \%hash ); ] };; s/iter a b a 1.12 -- -100% b 1.03e-006 108368213% --

      And yes, those two subroutines are functionally equivalent. And yes; that figure of 1 million times more efficient is real!

    No, those changes won't speed up your subroutine a million times; but they are changes worth making.

    Now, can you clarify your description of the processing your sub does with a concrete example or two? If you can, you might get some good responses that will make a real dent in your problem.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re: Optimise Sub, make hash unique based on values in array
by roboticus (Chancellor) on Aug 21, 2015 at 11:56 UTC

    anomilie:

    You're definitely making things far too difficult for yourself. This isn't a comprehensive review, since I have to go to work soon, but

    The first thing I noticed is that your code is difficult to read. Several factors contribute to this:

    • Your indentation isn't consistent. Proper indentation makes it easier to read throughout your code and get a feel for how it's structured. I suggest you use an editor to help you keep your code formatted, or at least use perltidy to clean up the indentation whenever you start having trouble reading it.
    • You've placed 'next' statements in too many places. Normally 'next' is a special case, rather than the typical case, so an experienced programmer will have to read and reread your code multiple times to find what's happening. (I'll revisit this point in a moment.)
    • You're duplicating code too often, especially in both branches of if/then statements. (I'll revisit this one, too.)

    It turns out that the last two points are related. Any time you have code like this:

    if (some condition) { ... some code ... $precords{$product_key}{"total_count"}++; # BOTH BRANCHES ... more code ... next; # BOTH BRANCHES } else { ... some code ... $precords{$product_key}{"total_count"}++; # BOTH BRANCHES ... more code ... next; # BOTH BRANCHES }

    See those lines labelled 'BOTH BRANCHES'? Since it's the same code, and nothing else in the 'more code' chunks affect those lines in both branches, you can move those statements to a more convenient location, like this:

    $precords{$product_key}{"total_count"}++; if (some condition) { ... some code ... ... more code ... } else { ... some code ... ... more code ... } next;

    I moved the 'next' statement until after the if/then statement, since we need to be sure that we execute the code. I put the other statement before the if/then statement because nothing in the if/then statement used that value.

    You want to avoid duplicating code, because it makes the code harder to read:

    • You use branching because you want to do different actions. Putting identical bits of code in both branches is confusing.
    • It makes your code longer.
    So when maintaining the code, you're making the programmer have to read more code, try to figure out why the same statements are being used in multiple places, all for no real benefit. Another problem with duplicated code is that if you have a "quick fix" you want to make and you make the fix, it may have only been a partial fix, as you have the *same* code duplicated in other places that need the change. By avoiding duplication, you avoid the chance of making a partial fix.

    Now let's take a look at your 'next' statements. Generally, they're used for special cases, but in your code, they're used so often that anyone making a change to your code will have to read and re-read the code frequently to see what it's doing. It can make the code pretty difficult to read.

    By repeatedly moving code duplicated in both branches of your if/then statements, your code boils down to something like this:

    for { if (value == '1') { ... do stuff ... next; } if (value == '0') { ... do stuff ... next; } ... do nothing ... }

    And since the value isn't changing in any of the code in your loop, the next statements don't do anything at all, so you could simplify it a little more to:

    for { if (value == '1') { ... do stuff ... } elsif (value == '0') { ... do stuff ... } }

    Note: I changed the if to elsif to let the programmer know that you can't execute both conditional blocks. In your original code, if you changed something in the first block to turn value to '0', then suddenly your program would start always executing the code in the second if statement, too!

    Finally, using == with strings is a recipe for bugs. If you want to compare numbers use ==, for strings use 'eq'. So statements like:

    if ($comboHashRef{$key}->[0] == '0')

    should be changed to either:

    if ($comboHashRef{$key}->[0] == 0)

    if you want to compare the value as a number, or:

    if ($comboHashRef{$key}->[0] eq '0')

    if you want to compare it as a string. If you're reading the records as strings from a file and parsing it, I'd suggest comparing it as a string.

    I wanted to go over some more stuff, like comments and variable names, but I ran out of time. Before I started this post, I noticed that others were already offering some comments, so I'll leave further comments to others...

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Optimise Sub, make hash unique based on values in array
by poj (Abbot) on Aug 21, 2015 at 07:19 UTC

    My best guess at what you are doing

    #!perl use strict; use Data::Dump 'pp'; my %combo = ( 'ID1' => [0,'sv1','sv2','pv1','pv2'], 'ID2' => [1,'sv1','sv2','pv1','pv2'], 'ID3' => [0,'sv1','sv3','pv1','pv2'], 'ID4' => [1,'sv1','sv2','pv1','pv2'], 'ID5' => [1,'sv1','sv2','pv1','pv2'], ); open my $fh,'>','output.txt' or die "$!"; my $count = uniqify($fh,3,\%combo); pp $count; sub uniqify { my ($fh,$pid_ix,$href) = @_; my %count=(); foreach my $key (keys %$href){ my ($purch,$sv1,$sv2,$pid) = @{$href->{$key}}[0..2,$pid_ix]; my $prodID = join '_',$sv1,$sv2; if ($purch){ $prodID .= '0'; } else { $prodID .= '_-'; } # counts if (exists $count{$prodID}{'dedup'}{$pid}){ ++$count{$prodID}{'dup_count'}; } else { ++$count{$prodID}{'dedup_count'}; $count{$prodID}{'dedup'}{$pid} = 1; print $fh (join "\t",$key,@{$href->{$key}})."\n"; } ++$count{$prodID}{'total_count'}; } return \%count; }
    poj