Merging partially duplicate lines

K_Edw has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Merging partially duplicate lines by Corion (Patriarch) on Jan 30, 2016 at 15:57 UTC
The best approach is to identify the duplicates using the approach found in perlfaq4. Merging the values of two duplicates sounds to be mostly a matter of arithmetic. Where in the task do you have problems?	[reply]
Re: Merging partially duplicate lines by duelafn (Parson) on Jan 30, 2016 at 19:54 UTC
Your samples appear to be sorted. If this is true of your actual data you wouldn't need to keep much state and could just use a merging algorithm, comparing only the first line of each file. Comment: This looks like you might be collating statistical responses (column 5 is average, column 6 is response count?). If so, would you want the weighted average rather than just regular average? (e.g., column 5 from first lines would be (6 * 0.25 + 0.30 * 8) / (6 + 8) = 0.27857). Of course, I have no idea what you are actually trying to do so feel free to ignore this if I'm mis-interpreting. Good Day, Dean	[reply]
Re^2: Merging partially duplicate lines by K_Edw (Beadle) on Jan 30, 2016 at 21:36 UTC
That is remarkably similar to the actual purpose of this - I would indeed require a weighted average but I thought it best to figure out the basics first. My data is sorted but some lines will simply be missing from some files.	[reply]
Re^3: Merging partially duplicate lines by poj (Abbot) on Jan 30, 2016 at 21:41 UTC
Here's a database solution showing the flexibility by adding the weighted average `#!perl use strict; use DBI; # create table my $dbh = create_db('database.sqlite'); # load data my @files = qw(fileA.txt fileB.txt); for my $file (@files){ load_db($dbh,$file); } # report my $query = 'SELECT A,B,C,D,AVG(E),SUM(F), MIN(E),MAX(E),COUNT(),SUM(EF)/SUM(F) FROM test GROUP BY A,B,C,D ORDER BY A,B,C,D'; report($dbh,$query);` [download] Read more... (1245 Bytes) poj	[reply] [d/l] [select]
Re^3: Merging partially duplicate lines by duelafn (Parson) on Jan 31, 2016 at 08:03 UTC
I'm not sure I agree with the others about using a database. Generating a string key is generally easy enough and if your input are already sorted, you can process huge files without consuming unreasonable memory. You would need to be certain that they are in fact sorted and that their sorting matches the sorting you create in the `parse_line` function. A merge which keeps all keys in memory is a bit safer in that respect, but can blow up your RAM if the files are large. #!/usr/bin/perl use strict; use warnings; use 5.014; open my $A, "<", "A" or die; open my $B, "<", "B" or die; sorted_merge($A, $B); # memory_merge($A, $B); sub sorted_merge { my @handle = @_; my @info; for my $fh (@handle) { my %h; @h{qw/key avg n/} = parse_line(scalar readline($fh)); push @info, \%h; } while (1) { # smallest key my ($next) = sort(grep defined($_), map $$_{key}, @info); last unless $next; my $sum = 0; my $n = 0; for my $i (0..$#handle) { next unless $info[$i]{key} and $info[$i]{key} eq $next; $sum += $info[$i]{avg} * $info[$i]{n}; $n += $info[$i]{n}; @{$info[$i]}{qw/key avg n/} = parse_line(scalar readline($ +handle[$i])); } next unless $n; print_line($next, $sum/$n, $n); } } sub memory_merge { my @handle = @_; my %data; for my $fh (@handle) { while (defined(my $line = <$fh>)) { my ($key, $avg, $n) = parse_line($line); if ($data{$key}) { $data{$key}{sum} += $avg * $n; $data{$key}{n} += $n; } else { $data{$key} = { sum => $avg * $n, n => $n, }; } } } for my $key (sort keys(%data)) { print_line($key, $data{$key}{sum}/$data{$key}{n}, $data{$key}{ +n}); } } sub print_line { my ($key, $avg, $n) = @_; my @cols = split /\s+/, $key; push @cols, $avg, $n; say join "\t", @cols; } sub parse_line { my $line = shift; return unless $line; my @col = split /\s+/, $line; # Format the key so that they sort correctly as strings. # Choose padding sizes carefully. my $key = sprintf "%-5s %4d %-10s %-10s", @col[0..3]; my $avg = $col[4]; my $n = $col[5]; return ($key, $avg, $n); } [download] Good Day, Dean	[reply] [d/l] [select]
Re: Merging partially duplicate lines -- oneliner deparsed by Discipulus (Canon) on Jan 30, 2016 at 19:58 UTC
..or with a oneliner, that is a bit complicated in the `END` block but not so uneasy to read (deparsed) `perl -F"\s+" -ane "push @{$r{join (' 'x8,@F[0..3]) }}, [@F[4,5]]; END +{foreach $k(keys %r){my($x,$y);map {$x+=$$_[0];$y+=$$_[1]} @{$r{$k}}; +print qq($k\t),($x/scalar @{$r{$k}}),qq(\t$y\n)}}" uno.txt due.txt I 33 C C 0.75 4 I 21 B A 1 12 I 40 D D 1 7 I 56 A E 1 2 I 9 A B 0.275 14` [download] which deparsed becomes `perl -MO=Deparse -F"\s+" -ane "push @{$r{join (' 'x8,@F[0..3]) }}, [@ +F[4,5]]; END{foreach $k(keys %r){my($x,$y);map {$x+=$$_[0];$y+=$$_[1] +} @{$r{$k}};print qq($k\t),($x/scalar @{$r{$k}}),qq(\t$y\n)}}" uno.t +xt due.txt LINE: while (defined($_ = <ARGV>)) { our(@F) = split(/\s+/, $_, 0); push @{$r{join ' ' x 8, @F[0..3]};}, [@F[4, 5]]; sub END { foreach $k (keys %r) { my($x, $y); map {$x += $$_[0]; $y += $$_[1];} @{$r{$k};}; print "$k\t", $x / scalar(@{$r{$k};}), "\t$y\n"; } } ; } -e syntax OK` [download] L* PS removed the unused Data::Dump There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: Merging partially duplicate lines -- oneliner deparsed by K_Edw (Beadle) on Jan 30, 2016 at 21:38 UTC
Thanks. This works well although it is quite difficult for me to read. For example - how and where in the script is the averaging of column 4 taking place?	[reply]
Re^3: Merging partially duplicate lines -- oneliner explained by Discipulus (Canon) on Jan 30, 2016 at 22:29 UTC
You are welcome K_Edw and.. sorry i was in hurry before dinner.. The ~~earth~~ heart of the code is the creation of the needed datastructure with `push @{$r{join (' 'x8,@F[0..3]) }}, [@F[4,5]];'` [download] we create the key of the hash `%r` as stringyfied join of fields 0..3 of the autosplitted `@F` array (see `-F"\s+" -a` in perlrun). This give us the uniqueness of the first four fields, used as a key. The value of that key is treated as an array and in this array is pushed another array, anonymous `[@F[4, 5]]` containing last two fields. One array is pushed every times the key is found again over files read. Using Data::Dump `dd` function as first thing in the `END` block you'll see the datastructure: `( "I 33 C C", [[0.5, 2], [1, 2]], "I 21 B A", [[1, 6], [1, 6]], "I 40 D D", [[1, 2], [1, 5]], "I 56 A E", [[1, 2]], "I 9 A B", [[0.25, 6], ["0.30", 8]], )` [download] When all files are processed the `END` block comes in play: for each key of the `%r` hash we use `map` to process all arrays contained as values of the key: every first value is added to `$x` (these are coming from all `$F[4]` values! ) and every second value is added to `$y` (coming from all `$F[5]` values) Vars `$x` and `$y` are declared with `my` so they are resetted for every key of the `%r` hash processed. Now that all is ready and while we are still processing the key of the `%r` hash we print the key, a tab, `$x` divided by how many values we used ( `scalar @{$r{$k}}` ie: the scalar value of the array contained in the `$r{$k}` ) or the average you asked for. Then the total value of `$y` and stop. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^4: Merging partially duplicate lines -- oneliner explained by K_Edw (Beadle) on Jan 31, 2016 at 10:53 UTC
Re: Merging partially duplicate lines by poj (Abbot) on Jan 30, 2016 at 16:08 UTC
How could this be done? Create a simple database with sqlite would be one way. Failing that, use a hash to aggregate records with the same key fields keeping a sum and count for each column Note : I get 7 for the sum here I 40 D D 1 7 poj	[reply]
Re: Merging partially duplicate lines by mr_ron (Chaplain) on Jan 30, 2016 at 19:44 UTC
As poj suggested, in these modern days importing into a relational database like Microsoft Access or sqlite might provide an easy SQL solution to your problem. If you want to look at your files as a text database then the problem looks like it fits an old text database processing tool: Awk. Perl has an awkish/autosplit '-a' mode suggesting: #!/usr/bin/perl -a use strict; use warnings; our %recs; my $k = join "\t", @F[0 .. 3]; if ($recs{$k}) { $recs{$k}->{key_count}++; $recs{$k}->{rec}->[$_] += $F[$_] for 4, 5; } else { # careful to copy with [ @F ] not \@F here $recs{$k} = {key_count => 1, rec => [ @F ]}; } END { $recs{$_}->{rec}->[4] /= $recs{$_}->{key_count} foreach keys %recs +; print join("\t",@{$recs{$_}->{rec}}), $/ foreach sort { $recs{$a}->{rec}->[ 0 ] cmp $recs{$b}->{rec}->[ 0 ] \|\| $recs{$a}->{rec}->[ 1 ] <=> $recs{$b}->{rec}->[ 1 ] \|\| $recs{$a}->{rec}->[ 2 ] cmp $recs{$b}->{rec}->[ 2 ] \|\| $recs{$a}->{rec}->[ 3 ] cmp $recs{$b}->{rec}->[ 3 ] } keys %recs; } [download] Ron	[reply] [d/l]
Re: Merging partially duplicate lines by marioroy (Prior) on Feb 04, 2016 at 01:35 UTC
The following solution is based on the version by mr_ron. The @keys array is for preserving key order. This is being posted mainly for demonstrating a version that is light on memory consumption. #!/usr/bin/perl -an use strict; use warnings; # usage: perl -an script.pl file.a file.b our ( $db, @keys ); my $key = join "\t", @F[0 .. 3]; push @keys, $key unless exists $db->{ $key }; $db->{ $key }[0]++; $db->{ $key }[1] += $F[4]; $db->{ $key }[2] += $F[5]; END { my ( $key, $aref, $cnt ); while ( @keys ) { $key = shift @keys; $aref = delete $db->{ $key }; ( ( $cnt = $aref->[0] ) > 1 ) ? print $key."\t".( $aref->[1] / $cnt )."\t".$aref->[2]."\n" : print $key."\t".( $aref->[1] )."\t".$aref->[2]."\n"; } } [download]	[reply] [d/l]


Clear questions and runnable code get the best and fastest answer
	PerlMonks