#!/usr/bin/perl

=head1 NAME

venn-list

=head1 SYNOPSIS

venn-list one.hist two.hist red.hist blue.hist ...

=head1 DESCRIPTION

This takes all the histogram files named on the command line
(e.g. *.hist), and outputs a single combined histogram that reports
the union of all the inputs, along with information on the source(s)
for each element of the union.

A histogram is simply the output of a unix "sort | uniq -c" command
line or anything equivalent: each line consists of a number and a
string, where the number indicates how many times the given string was
found to occur in some set of source data.  There may be whitespace
before digits, and any amount of whitespace (tab and/or space
characters) may fall between the digits and the following string.
(The string itself may contain internal whitespace -- we only split
the initial digits from the rest of the line.)  The ordering of lines
within each input histogram file is not important.

We assume that any given run will not involve more than 52 distinct
histogram files.  Each file is assigned a letter code (A-Za-z), and a
"key legend" is printed to STDOUT first, to map the given file names
to the assigned letters.  Following the legend, there is a blank line,
and then the complete list of distinct types found among all outputs.
This is sorted ASCII-betically according to the strings (not according
to relative frequency, but of course unix/GNU "sort" can be used on
the output if desired).

In addition to listing all types with their summed frequencies from
all the histogram files, two extra fields are inserted between the the
frequency and the string value: the first added field will be the
number of input histograms containing the given type, and the second
will be a string of one or more letters, each letter representing a
specific input file that contained the given strong value.  The four
fields (union frequency, number of sources, keys to sources, string
value) are tab-delimited.

=head1 NIT-PICKY DETAIL

As currently written, venn-list assumes that trailing whitespace
characters (at the end of a line, following the string value) should
be ignored, and it deletes them before adding a given type into the
union; "sort | uniq -c" does not share this assumption, and will count
and sort "string" separately from "string ".  If your input data
includes lines with trailing spaces or tabs, you'll find that these
entries are being combined (summed together) by venn-list.

=head1 AUTHOR

David Graff

=cut

use strict;

my $Usage = "$0 one.hist two.hist red.hist blue.hist ... > union.hist\n";
die $Usage unless ( @ARGV > 1 and -f $ARGV[0] );

my %file;
my $fid = 'A';
while ( @ARGV ) {
    print "   $fid : $ARGV[0]\n";
    $file{$fid++} = shift @ARGV;
    if ( $fid eq 'AA' ) {
        $fid = 'a';
    }
    elsif ( $fid eq 'aa' ) {
        warn sprintf( "$0: stopped at 52 inputs (dropped last %d files)\n",
                      scalar @ARGV );
    }
}
print "\n";

my %type;
for my $fkey ( sort keys %file ) {
    open( I, $file{$fkey} ) or do { warn "$file{$fkey}: $!\n"; next };
    while (<I>) {
        s/\s+$//;
        s/^\s+//;
        my ( $frq, $typ ) = split( /\s+/, $_, 2 );
        $type{$typ}{src} .= $fkey;
        $type{$typ}{frq} += $frq;
    }
}

for my $typ ( sort keys %type ) {
    my $src = $type{$typ}{src};
    $src =~ s/(\w)\1+/$1/g;   # get rid of repeated letters (AAABBC -> ABC)
    my $len = length( $src );
    printf( "%d\t%d\t%s\t%s\n",
            $type{$typ}{frq}, $len,
            $type{$typ}{src}, $typ );
}