in reply to Merging partially duplicate lines

..or with a oneliner, that is a bit complicated in the END block but not so uneasy to read (deparsed)

perl -F"\s+" -ane "push @{$r{join (' 'x8,@F[0..3]) }}, [@F[4,5]]; END +{foreach $k(keys %r){my($x,$y);map {$x+=$$_[0];$y+=$$_[1]} @{$r{$k}}; +print qq($k\t),($x/scalar @{$r{$k}}),qq(\t$y\n)}}" uno.txt due.txt I 33 C C 0.75 4 I 21 B A 1 12 I 40 D D 1 7 I 56 A E 1 2 I 9 A B 0.275 14

which deparsed becomes

perl -MO=Deparse -F"\s+" -ane "push @{$r{join (' 'x8,@F[0..3]) }}, [@ +F[4,5]]; END{foreach $k(keys %r){my($x,$y);map {$x+=$$_[0];$y+=$$_[1] +} @{$r{$k}};print qq($k\t),($x/scalar @{$r{$k}}),qq(\t$y\n)}}" uno.t +xt due.txt LINE: while (defined($_ = <ARGV>)) { our(@F) = split(/\s+/, $_, 0); push @{$r{join ' ' x 8, @F[0..3]};}, [@F[4, 5]]; sub END { foreach $k (keys %r) { my($x, $y); map {$x += $$_[0]; $y += $$_[1];} @{$r{$k};}; print "$k\t", $x / scalar(@{$r{$k};}), "\t$y\n"; } } ; } -e syntax OK

L*

PS removed the unused Data::Dump

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: Merging partially duplicate lines -- oneliner deparsed
by K_Edw (Beadle) on Jan 30, 2016 at 21:38 UTC
    Thanks. This works well although it is quite difficult for me to read. For example - how and where in the script is the averaging of column 4 taking place?
      You are welcome K_Edw and.. sorry i was in hurry before dinner..

      The earth heart of the code is the creation of the needed datastructure with

      push @{$r{join (' 'x8,@F[0..3]) }}, [@F[4,5]];'
      we create the key of the hash %r as stringyfied join of fields 0..3 of the autosplitted @F array (see -F"\s+" -a in perlrun). This give us the uniqueness of the first four fields, used as a key. The value of that key is treated as an array and in this array is pushed another array, anonymous [@F[4, 5]] containing last two fields. One array is pushed every times the key is found again over files read.

      Using Data::Dump dd function as first thing in the END block you'll see the datastructure:

      ( "I 33 C C", [[0.5, 2], [1, 2]], "I 21 B A", [[1, 6], [1, 6]], "I 40 D D", [[1, 2], [1, 5]], "I 56 A E", [[1, 2]], "I 9 A B", [[0.25, 6], ["0.30", 8]], )

      When all files are processed the END block comes in play: for each key of the %r hash we use map to process all arrays contained as values of the key: every first value is added to $x (these are coming from all $F[4] values! ) and every second value is added to $y (coming from all $F[5] values) Vars $x and $y are declared with my so they are resetted for every key of the %r hash processed.

      Now that all is ready and while we are still processing the key of the %r hash we print the key, a tab, $x divided by how many values we used ( scalar @{$r{$k}} ie: the scalar value of the array contained in the $r{$k} ) or the average you asked for. Then the total value of $y and stop.

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
        Thank you very much for the explanation! That makes a lot more sense now.