Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Merging partially duplicate lines

by K_Edw (Beadle)
on Jan 30, 2016 at 15:22 UTC ( [id://1154068]=perlquestion: print w/replies, xml ) Need Help??

K_Edw has asked for the wisdom of the Perl Monks concerning the following question:

Given a set (2 or more) of tab-delimited .txt files in the format of:

File A:
I 9 A B 0.25 6 I 21 B A 1 6 I 33 C C 0.5 2 I 40 D D 1 2 I 56 A E 1 2
File B:
I 9 A B 0.30 8 I 21 B A 1 6 I 33 C C 1 2 I 40 D D 1 5

Is it possible to identify partially duplicate lines based solely on the contents of columns-1-4 and then merge the numerical contents of columns 5-6 into a single line? For example, given the two above .txt files, the desired output would be:

I 9 A B 0.275 14 I 21 B A 1 12 I 33 C C 0.75 4 I 40 D D 1 7 I 56 A E 1 2

i.e. if the contents of columns 1-4 match those of 1 or more other lines, average the numerical value in column 5 between all matched lines (for example on the first line - (0.25+0.3)/2 = 0.275) and sum the contents of column 6 between all matched lines (for example on the first line - 6+8=14). If the content of columns 1-4 is unique, print the line without modification (for example - the final line of File A which has no cognate partner within File B)

How could this be done? My knowledge of Perl is rather limited; I would presume I would need to read all of the lines into a single array and then perhaps use seen or the uniq function found in List::MoreUtils to begin to look for partial duplicate lines, but I am rather unsure. Thank you in advance for any help.

Replies are listed 'Best First'.
Re: Merging partially duplicate lines
by Corion (Patriarch) on Jan 30, 2016 at 15:57 UTC

    The best approach is to identify the duplicates using the approach found in perlfaq4. Merging the values of two duplicates sounds to be mostly a matter of arithmetic.

    Where in the task do you have problems?

Re: Merging partially duplicate lines
by duelafn (Parson) on Jan 30, 2016 at 19:54 UTC

    Your samples appear to be sorted. If this is true of your actual data you wouldn't need to keep much state and could just use a merging algorithm, comparing only the first line of each file.

    Comment: This looks like you might be collating statistical responses (column 5 is average, column 6 is response count?). If so, would you want the weighted average rather than just regular average? (e.g., column 5 from first lines would be (6 * 0.25 + 0.30 * 8) / (6 + 8) = 0.27857). Of course, I have no idea what you are actually trying to do so feel free to ignore this if I'm mis-interpreting.

    Good Day,
        Dean

      That is remarkably similar to the actual purpose of this - I would indeed require a weighted average but I thought it best to figure out the basics first. My data is sorted but some lines will simply be missing from some files.

        Here's a database solution showing the flexibility by adding the weighted average

        #!perl use strict; use DBI; # create table my $dbh = create_db('database.sqlite'); # load data my @files = qw(fileA.txt fileB.txt); for my $file (@files){ load_db($dbh,$file); } # report my $query = 'SELECT A,B,C,D,AVG(E),SUM(F), MIN(E),MAX(E),COUNT(*),SUM(E*F)/SUM(F) FROM test GROUP BY A,B,C,D ORDER BY A,B,C,D'; report($dbh,$query);
        poj

        I'm not sure I agree with the others about using a database. Generating a string key is generally easy enough and if your input are already sorted, you can process huge files without consuming unreasonable memory. You would need to be certain that they are in fact sorted and that their sorting matches the sorting you create in the parse_line function. A merge which keeps all keys in memory is a bit safer in that respect, but can blow up your RAM if the files are large.

        #!/usr/bin/perl use strict; use warnings; use 5.014; open my $A, "<", "A" or die; open my $B, "<", "B" or die; sorted_merge($A, $B); # memory_merge($A, $B); sub sorted_merge { my @handle = @_; my @info; for my $fh (@handle) { my %h; @h{qw/key avg n/} = parse_line(scalar readline($fh)); push @info, \%h; } while (1) { # smallest key my ($next) = sort(grep defined($_), map $$_{key}, @info); last unless $next; my $sum = 0; my $n = 0; for my $i (0..$#handle) { next unless $info[$i]{key} and $info[$i]{key} eq $next; $sum += $info[$i]{avg} * $info[$i]{n}; $n += $info[$i]{n}; @{$info[$i]}{qw/key avg n/} = parse_line(scalar readline($ +handle[$i])); } next unless $n; print_line($next, $sum/$n, $n); } } sub memory_merge { my @handle = @_; my %data; for my $fh (@handle) { while (defined(my $line = <$fh>)) { my ($key, $avg, $n) = parse_line($line); if ($data{$key}) { $data{$key}{sum} += $avg * $n; $data{$key}{n} += $n; } else { $data{$key} = { sum => $avg * $n, n => $n, }; } } } for my $key (sort keys(%data)) { print_line($key, $data{$key}{sum}/$data{$key}{n}, $data{$key}{ +n}); } } sub print_line { my ($key, $avg, $n) = @_; my @cols = split /\s+/, $key; push @cols, $avg, $n; say join "\t", @cols; } sub parse_line { my $line = shift; return unless $line; my @col = split /\s+/, $line; # Format the key so that they sort correctly as strings. # Choose padding sizes carefully. my $key = sprintf "%-5s %4d %-10s %-10s", @col[0..3]; my $avg = $col[4]; my $n = $col[5]; return ($key, $avg, $n); }

        Good Day,
            Dean

Re: Merging partially duplicate lines -- oneliner deparsed
by Discipulus (Canon) on Jan 30, 2016 at 19:58 UTC
    ..or with a oneliner, that is a bit complicated in the END block but not so uneasy to read (deparsed)

    perl -F"\s+" -ane "push @{$r{join (' 'x8,@F[0..3]) }}, [@F[4,5]]; END +{foreach $k(keys %r){my($x,$y);map {$x+=$$_[0];$y+=$$_[1]} @{$r{$k}}; +print qq($k\t),($x/scalar @{$r{$k}}),qq(\t$y\n)}}" uno.txt due.txt I 33 C C 0.75 4 I 21 B A 1 12 I 40 D D 1 7 I 56 A E 1 2 I 9 A B 0.275 14

    which deparsed becomes

    perl -MO=Deparse -F"\s+" -ane "push @{$r{join (' 'x8,@F[0..3]) }}, [@ +F[4,5]]; END{foreach $k(keys %r){my($x,$y);map {$x+=$$_[0];$y+=$$_[1] +} @{$r{$k}};print qq($k\t),($x/scalar @{$r{$k}}),qq(\t$y\n)}}" uno.t +xt due.txt LINE: while (defined($_ = <ARGV>)) { our(@F) = split(/\s+/, $_, 0); push @{$r{join ' ' x 8, @F[0..3]};}, [@F[4, 5]]; sub END { foreach $k (keys %r) { my($x, $y); map {$x += $$_[0]; $y += $$_[1];} @{$r{$k};}; print "$k\t", $x / scalar(@{$r{$k};}), "\t$y\n"; } } ; } -e syntax OK

    L*

    PS removed the unused Data::Dump

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Thanks. This works well although it is quite difficult for me to read. For example - how and where in the script is the averaging of column 4 taking place?
        You are welcome K_Edw and.. sorry i was in hurry before dinner..

        The earth heart of the code is the creation of the needed datastructure with

        push @{$r{join (' 'x8,@F[0..3]) }}, [@F[4,5]];'
        we create the key of the hash %r as stringyfied join of fields 0..3 of the autosplitted @F array (see -F"\s+" -a in perlrun). This give us the uniqueness of the first four fields, used as a key. The value of that key is treated as an array and in this array is pushed another array, anonymous [@F[4, 5]] containing last two fields. One array is pushed every times the key is found again over files read.

        Using Data::Dump dd function as first thing in the END block you'll see the datastructure:

        ( "I 33 C C", [[0.5, 2], [1, 2]], "I 21 B A", [[1, 6], [1, 6]], "I 40 D D", [[1, 2], [1, 5]], "I 56 A E", [[1, 2]], "I 9 A B", [[0.25, 6], ["0.30", 8]], )

        When all files are processed the END block comes in play: for each key of the %r hash we use map to process all arrays contained as values of the key: every first value is added to $x (these are coming from all $F[4] values! ) and every second value is added to $y (coming from all $F[5] values) Vars $x and $y are declared with my so they are resetted for every key of the %r hash processed.

        Now that all is ready and while we are still processing the key of the %r hash we print the key, a tab, $x divided by how many values we used ( scalar @{$r{$k}} ie: the scalar value of the array contained in the $r{$k} ) or the average you asked for. Then the total value of $y and stop.

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Merging partially duplicate lines
by poj (Abbot) on Jan 30, 2016 at 16:08 UTC
    How could this be done?

    Create a simple database with sqlite would be one way. Failing that, use a hash to aggregate records with the same key fields keeping a sum and count for each column

    Note : I get 7 for the sum here

    I       40      D       D       1       7
    
    poj
Re: Merging partially duplicate lines
by mr_ron (Chaplain) on Jan 30, 2016 at 19:44 UTC

    As poj suggested, in these modern days importing into a relational database like Microsoft Access or sqlite might provide an easy SQL solution to your problem. If you want to look at your files as a text database then the problem looks like it fits an old text database processing tool: Awk. Perl has an awkish/autosplit '-a' mode suggesting:

    #!/usr/bin/perl -a use strict; use warnings; our %recs; my $k = join "\t", @F[0 .. 3]; if ($recs{$k}) { $recs{$k}->{key_count}++; $recs{$k}->{rec}->[$_] += $F[$_] for 4, 5; } else { # careful to copy with [ @F ] not \@F here $recs{$k} = {key_count => 1, rec => [ @F ]}; } END { $recs{$_}->{rec}->[4] /= $recs{$_}->{key_count} foreach keys %recs +; print join("\t",@{$recs{$_}->{rec}}), $/ foreach sort { $recs{$a}->{rec}->[ 0 ] cmp $recs{$b}->{rec}->[ 0 ] || $recs{$a}->{rec}->[ 1 ] <=> $recs{$b}->{rec}->[ 1 ] || $recs{$a}->{rec}->[ 2 ] cmp $recs{$b}->{rec}->[ 2 ] || $recs{$a}->{rec}->[ 3 ] cmp $recs{$b}->{rec}->[ 3 ] } keys %recs; }
    Ron
Re: Merging partially duplicate lines
by marioroy (Prior) on Feb 04, 2016 at 01:35 UTC

    The following solution is based on the version by mr_ron. The @keys array is for preserving key order. This is being posted mainly for demonstrating a version that is light on memory consumption.

    #!/usr/bin/perl -an use strict; use warnings; # usage: perl -an script.pl file.a file.b our ( $db, @keys ); my $key = join "\t", @F[0 .. 3]; push @keys, $key unless exists $db->{ $key }; $db->{ $key }[0]++; $db->{ $key }[1] += $F[4]; $db->{ $key }[2] += $F[5]; END { my ( $key, $aref, $cnt ); while ( @keys ) { $key = shift @keys; $aref = delete $db->{ $key }; ( ( $cnt = $aref->[0] ) > 1 ) ? print $key."\t".( $aref->[1] / $cnt )."\t".$aref->[2]."\n" : print $key."\t".( $aref->[1] )."\t".$aref->[2]."\n"; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1154068]
Front-paged by Discipulus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-04-19 22:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found