Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've inherited a set of scripts and, not being an expert in Perl, there's a particular chunk of code which i'm struggling to understand:

while (<IN>) { our(@F) = split(/\s+/, $_, 0); push @{$r{join ' ' x 8, @F[0..3]};}, [@F[4, 6]]; sub END { foreach $k (keys %r) { my($x, $y); map {$x += $$_[0]; $y += $$_[1];} @{$r{$k};}; my @g = split(/\s+/,$k); print OUT "$g[0]\t@g[1]\t@g[2]\t@g[3]\t", $x / scalar(@{$r +{$k};}), "\t$y\n"; } } }

This takes an input file in this format:

1 111 C T 1 0 6 1 136 G A 1 0 6 1 136 G A 1 0 9 1 244 C CT 1 0 2 1 262 A G 1 0 2 1 268 A C 1 0 2 1 268 A C 1 0 4 1 270 C T 1 0 2

Finds all unique entries (based on the first four columns), averages the 5th and sums the 6th.

There's a few functions and pieces of syntax i'm unfamiliar with. Specifically, the use "$$_" (the use of a double $), exactly why 'map' is used here and how this line:

map {$x += $$_[0]; $y += $$_[1];} @{$r{$k};};

Is producing the desired result in general. If anybody could explain or clarify how this chunk of code works, that would be great, thanks!

Replies are listed 'Best First'.
Re: Code clarification - use of map and $$_
by Corion (Patriarch) on Aug 09, 2016 at 13:47 UTC

    In your case, it's not $$_, but $$_[...]. The $$... is dereferencing a reference.

    $$_[...] can be rewritten as $_->[ ... ], which might make the indexing of an array more obvious to you. See also References Quick Reference.

      Collective huge thank you to everybody who contributed to this thread! Much to learn and read. Again: thanks!
      Thanks for that. Why exactly would dereferencing be used here? Why not direct access to the variable?

        Because the list returned by @{$r{$k}} is a list of array references. On each iteration of the map loop one element is passed in from the array, @{$r{$k}}, to $_. That element is a reference to an array. Thus, to act upon its contents, you dereference it.


        Dave

        > Why exactly would dereferencing be used here? Why not direct access to the variable?

        without digging too deep into this code ...

        map can only iterate over scalars.

        I.e. like a list of $array_refs, if you want to address different arrays ...

        > Why not direct access to the variable?

        if you mean something like @array as the "direct" variable, you CAN'T do something like

        map { $_[0]++ } (@a,@b,@c)

        to increment the first element of each array.

        The real problem with that code is the laziness of the author to use a clear style.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        I don't know. Maybe ask the original author of the script.

Re: Code clarification - use of map and $$_
by AnomalousMonk (Archbishop) on Aug 09, 2016 at 15:01 UTC
    while (<IN>) { our(@F) = split(/\s+/, $_, 0); push @{$r{join ' ' x 8, @F[0..3]};}, [@F[4, 6]]; sub END { foreach $k (keys %r) { my($x, $y); map {$x += $$_[0]; $y += $$_[1];} @{$r{$k};}; my @g = split(/\s+/,$k); print OUT "$g[0]\t@g[1]\t@g[2]\t@g[3]\t", $x / scalar(@{$r +{$k};}), "\t$y\n"; } } }

    Another odd thing to note about this code is the  END block planted in the middle of it, written in a disparaged way as a sub block. Please see the "BEGIN, UNITCHECK, CHECK, INIT and END" section in perlmod. Because all  END blocks run at the end (!) of all other code, I think this chunk of code could more clearly and conventionally be written as:

    while (<IN>) { our(@F) = split(/\s+/, $_, 0); push @{$r{join ' ' x 8, @F[0..3]};}, [@F[4, 6]]; } ... all other code ... END { foreach $k (keys %r) { my($x, $y); map {$x += $$_[0]; $y += $$_[1];} @{$r{$k};}; my @g = split(/\s+/,$k); print OUT "$g[0]\t@g[1]\t@g[2]\t@g[3]\t", $x / scalar(@{$r{$k} +;}), "\t$y\n"; } }
    Good luck.


    Give a man a fish:  <%-{-{-{-<

      ++AnomalousMonk: that helped (me, anyway; don't know about the OP). That was my first guess, but my initial experiments with the OP code didn't mesh. But I later saw that the print went to OUT, and forgot to redo a test to STDOUT. Thus I started thinking that it was a sub named END in a bout of horrible style. But I think you're right, it's an END block.

      However, with this simplified code, I still only see the FIRST instance of the END block doing anything:

      use strict; use warnings; $, = ","; $\ = "\n"; $" = ";"; my %r = ( 0 => 0, 1 => 0, 2 => 0, 3 => 0 ); foreach (1..10) { my $x = $_ % 4; ++$r{$x}; sub END { print __LINE__, "sub END block", $x, $r{$x} }; } foreach(1 .. 10) { my $x = $_ % 4; print "$_ => $x => $r{$x}"}; print __LINE__, "END OF SCRIPT"; __END__ __OUTPUT__ 1 => 1 => 3 2 => 2 => 3 3 => 3 => 2 4 => 0 => 2 5 => 1 => 3 6 => 2 => 3 7 => 3 => 2 8 => 0 => 2 9 => 1 => 3 10 => 2 => 3 17,END OF SCRIPT 13,sub END block,1,3
        ... I still only see the FIRST instance of the END block doing anything ...

        From this I think you've already gotten the point, but anyway... There is only one instance of any given  END block in a program:

        c:\@Work\Perl\monks>perl -wMstrict -le "for my $str (qw(one two three)) { print qq{in for loop: '$str'}; END { print 'END block ONE'; } END { print 'END block TWO'; } END { print 'END block THREE'; } } " in for loop: 'one' in for loop: 'two' in for loop: 'three' END block THREE END block TWO END block ONE


        Give a man a fish:  <%-{-{-{-<

        Oops, re-reading AnomalousMonk's post and the OP, I was reminded that there was a loop inside the END block, and that does what was intended.

        use strict; use warnings; $, = ","; $\ = "\n"; $" = ";"; my %r = ( 0 => 0, 1 => 0, 2 => 0, 3 => 0 ); foreach (1..10) { my $x = $_ % 4; ++$r{$x}; sub END { $, = " => "; foreach my $k ( keys %r ) { print __LINE__, "sub END block with k", $k, $r{$k} } }; } foreach(1 .. 10) { my $x = $_ % 4; print "$_ => $x => $r{$x}"}; print __LINE__, "END OF SCRIPT"; __END__ __OUTPUT__ 1 => 1 => 3 2 => 2 => 3 3 => 3 => 2 4 => 0 => 2 5 => 1 => 3 6 => 2 => 3 7 => 3 => 2 8 => 0 => 2 9 => 1 => 3 10 => 2 => 3 22,END OF SCRIPT 16 => sub END block with k => 0 => 2 16 => sub END block with k => 1 => 3 16 => sub END block with k => 3 => 2 16 => sub END block with k => 2 => 3
Re: Code clarification - use of map and $$_
by pryrt (Abbot) on Aug 09, 2016 at 15:00 UTC

    That's got some strange notation -- lots of semicolons where they are technically allowed, but I've never seen anybody use them there. And trying to make lots of subs called END, one per line of the input file, is just unfathomable. (I tried a quick test where I tried to make multiple named subs inside a loop like that, and call them both inside and outside the loop; it didn't do anything that makes sense to me.) Also, that sub END is never called, at least in your snippet.

    However, when I just removed sub END from before that block, so that it would just execute the block, and set *OUT = *STDOUT, I was able to better see what was going on. To help figure things out, I also added some print statements before and inside map's block

    push @{$r{join ' ' x 8, @F[0..3]};}, [@F[4, 6]]; { foreach my $k (keys %r) { my($x, $y); print "\$r{\$k};", $r{$k}; print "\@{\$r{\$k};}", @{$r{$k};}; map {$x += $$_[0]; $y += $$_[1]; print "\$_='$_'", "SS_[0]=$$_[0]", "SS_[1]=$$_[1]", "x=$x", "y=$y"; } @{$r{$k};}; my @g = split(/\s+/,$k); print OUT "$g[0]\t@g[1]\t@g[2]\t@g[3]\t", $x / scalar(@{$r +{$k};}), "\t$y\n"; } }

    From what I can tell, you've got a hash %r, with keys made from the joining the first four columns. Each element of that hash is an array ref; the array contained within holds array-refs to the col4,col6 pairs. Thus, the map line says: For a given key $k, get the array behind the array ref for that element (@{ $r{$k} }). The map says, for each element in that array (so, for each array ref that points to the col4,col6 pairs), which map's block will refer to as $_, run the block. The block says to add the col4 value (which is the first element of the array referenced by $_) to $x and add the col6 value (which is the second element of the array referenced by $_) to $y.

    That's as confusing as mud, I'm sure. When $k refers to the second '1 136 G A' line:

    $k; # == "1 136 G A" with more spaces $r{$k}; # == [ [1,6], [1,9] ] == referenece to an array of arr +ay-refs @{$r{$k}}; # == ( [1,6], [1,9] ) == array of array-refs map {} @ # for each element in the @ array, run the {} block # First element of @ is the first ref to a pair-array: $_; # = [1,6] == array ref $$_; # = (1,6) == array $$_[0]; # = 1 == first element of array (1,6) or arrayref [1,6] $$_[1]; # = 6 # Second element of @ is the next ref to a pair-array $_; # = [1,9] $$_[0]; # 1 $$_[1]; # 9

    But there are lots of other oddities. I believe the END sub will only get the first definition, so I believe it can only ever print out the '1 111 C T' results, which isn't overly helpful. And I never see it called. And using @g[1] is pointless, and should be $g[1], because it's a single element of an array, so you don't need it to be an array slice.

    As corion said, if possible, ask the original author. Otherwise, I hope these hints have helped.

Re: Code clarification - use of map and $$_
by pryrt (Abbot) on Aug 09, 2016 at 16:28 UTC

    There is much good advice and learning throughout the thread... to sum up what my recommendations would be

    • add comments as you learn things, so when you go to support it a year down the road, and have forgotten everything you learned in this thread, you'll at least have comments to guide you. (also, add a comment to link to this thread. :-) )
    • Re: Code clarification - use of map and $$_ = switch from map to a for or foreach loop; additionally, I'd recommend using a meaningful loop variable name (I'd call it $pair_ref or similar, to indicate it's a reference to a pair of something) rather than the default $_
    • instead of using the $$_[0] notation, or even the more obviously meaningful $_->[0], for accessing the elements of the pair, I'd probably change to assigning the individual elements to meaningful variables:
      foreach my $pair_ref ( @{ $r{$k} } ) { # for each [col4,col6] pai +r that matched on the four-column key my ($col4, $col6) = @$pair_ref; # get the (col4,col6) valu +es $x += $col4; # x is the summation of al +l the matching col4 values $y += $col6; # similar for y }
    • Re: Code clarification - use of map and $$_ = move the END block out of the while(<IN>) loop
    • Fix the print statement in the END block to not use @g[1], since that's a 1-element slice, and is better written as $g[1]. You might want to further look into the perlvar $, and $" variables for automatically joining within your print statement (or use a manual join function if you want to make the joining explicit1) rather than manually placing tabs between each element of the @g array, $x/..., and $y values.

    1 I know for many non-expert Perl coders, the explicit join is more natural and possibly easier to remember in the future than the magic variables; personally, despite having hacked Perl for ... eek, two decades now! -- it wasn't until I started really frequenting perlmonks a few months back that I actually understood what the $, and $" variables do, and started using them

Re: Code clarification - use of map and $$_
by Anonymous Monk on Aug 09, 2016 at 14:57 UTC

    The code

    map {$x += $$_[0]; $y += $$_[1];} @{$r{$k};};
    is better written as:
    for (@{ $r{$k} })   { $x += $_->[0]; $y += $_->[1]; }
    (Using map statement in void context in lieu of for is confusing.)

    The dereferences are used because the structure is populated with array references. In other words, you have a hash of arrays (HoA). See perldata, perldsc.

    Specifically, look at the matching statements:

    push @{ $r{ ... } }, [ ... ]; ... for (@{ $r{ ... } }) { ... $_->[...] }
    The first populates the HoA with necessary data. The second one uses this data further down. Why HoA, why the need for array [ constructors ] and dereferences? Because one scalar was not enough. The original programmer needed to track two values and used the small arrays as tuples.

Re: Code clarification - use of map and $$_
by perldigious (Priest) on Aug 09, 2016 at 16:48 UTC

    Inherited code... Hmm, I believe I also see a bareword filehandle there in the print line.

    print OUT "$g[0]\t@g[1]\t@g[2]\t@g[3]\t", $x / scalar(@{$r{$k};}), "\t$y\n";

    *hisses like a vampire who just suddenly had sunlight cast on him and scurries away rapidly*

    UPDATE:

    With <IN> as well. Bareword vs. Indirect Filehandle

    I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
    I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious