comment on

I'm not sure I agree with the others about using a database. Generating a string key is generally easy enough and if your input are already sorted, you can process huge files without consuming unreasonable memory. You would need to be certain that they are in fact sorted and that their sorting matches the sorting you create in the parse_line function. A merge which keeps all keys in memory is a bit safer in that respect, but can blow up your RAM if the files are large.

#!/usr/bin/perl
use strict; use warnings; use 5.014;

open my $A, "<", "A" or die;
open my $B, "<", "B" or die;

sorted_merge($A, $B);
# memory_merge($A, $B);

sub sorted_merge {
    my @handle = @_;
    my @info;

    for my $fh (@handle) {
        my %h;
        @h{qw/key avg n/} = parse_line(scalar readline($fh));
        push @info, \%h;
    }

    while (1) {
        # smallest key
        my ($next) = sort(grep defined($_), map $$_{key}, @info);
        last unless $next;

        my $sum = 0;
        my $n = 0;
        for my $i (0..$#handle) {
            next unless $info[$i]{key} and $info[$i]{key} eq $next;

            $sum += $info[$i]{avg} * $info[$i]{n};
            $n   += $info[$i]{n};
            @{$info[$i]}{qw/key avg n/} = parse_line(scalar readline($
+handle[$i]));
        }

        next unless $n;
        print_line($next, $sum/$n, $n);
    }
}

sub memory_merge {
    my @handle = @_;
    my %data;

    for my $fh (@handle) {
        while (defined(my $line = <$fh>)) {
            my ($key, $avg, $n) = parse_line($line);

            if ($data{$key}) {
                $data{$key}{sum} += $avg * $n;
                $data{$key}{n}   += $n;
            }
            else {
                $data{$key} = {
                    sum => $avg * $n,
                    n   => $n,
                };
            }
        }
    }

    for my $key (sort keys(%data)) {
        print_line($key, $data{$key}{sum}/$data{$key}{n}, $data{$key}{
+n});
    }
}


sub print_line {
    my ($key, $avg, $n) = @_;
    my @cols = split /\s+/, $key;
    push @cols, $avg, $n;
    say join "\t", @cols;
}

sub parse_line {
    my $line = shift;
    return unless $line;
    my @col = split /\s+/, $line;
    # Format the key so that they sort correctly as strings.
    # Choose padding sizes carefully.
    my $key = sprintf "%-5s %4d %-10s %-10s", @col[0..3];
    my $avg = $col[4];
    my $n   = $col[5];
    return ($key, $avg, $n);
}
[download]

Good Day,
Dean

In reply to Re^3: Merging partially duplicate lines by duelafn
in thread Merging partially duplicate lines by K_Edw

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.