venn-list: produce union of histograms

Category:	Utility Scripts
Author/Contact Info	graff
Description:	I needed this in order to assemble a word counts from 8 different sources of text, keeping track of which words came from which sources, and what the overall word frequencies were. So simple, yet so useful (the POD is longer than the code itself).
#!/usr/bin/perl =head1 NAME venn-list =head1 SYNOPSIS venn-list one.hist two.hist red.hist blue.hist ... =head1 DESCRIPTION This takes all the histogram files named on the command line (e.g. *.hist), and outputs a single combined histogram that reports the union of all the inputs, along with information on the source(s) for each element of the union. A histogram is simply the output of a unix "sort \| uniq -c" command line or anything equivalent: each line consists of a number and a string, where the number indicates how many times the given string was found to occur in some set of source data. There may be whitespace before digits, and any amount of whitespace (tab and/or space characters) may fall between the digits and the following string. (The string itself may contain internal whitespace -- we only split the initial digits from the rest of the line.) The ordering of lines within each input histogram file is not important. We assume that any given run will not involve more than 52 distinct histogram files. Each file is assigned a letter code (A-Za-z), and a "key legend" is printed to STDOUT first, to map the given file names to the assigned letters. Following the legend, there is a blank line, and then the complete list of distinct types found among all outputs. This is sorted ASCII-betically according to the strings (not according to relative frequency, but of course unix/GNU "sort" can be used on the output if desired). In addition to listing all types with their summed frequencies from all the histogram files, two extra fields are inserted between the the frequency and the string value: the first added field will be the number of input histograms containing the given type, and the second will be a string of one or more letters, each letter representing a specific input file that contained the given strong value. The four fields (union frequency, number of sources, keys to sources, string value) are tab-delimited. =head1 NIT-PICKY DETAIL As currently written, venn-list assumes that trailing whitespace characters (at the end of a line, following the string value) should be ignored, and it deletes them before adding a given type into the union; "sort \| uniq -c" does not share this assumption, and will count and sort "string" separately from "string ". If your input data includes lines with trailing spaces or tabs, you'll find that these entries are being combined (summed together) by venn-list. =head1 AUTHOR David Graff =cut use strict; my $Usage = "$0 one.hist two.hist red.hist blue.hist ... > union.hist\ +n"; die $Usage unless ( @ARGV > 1 and -f $ARGV[0] ); my %file; my $fid = 'A'; while ( @ARGV ) { print " $fid : $ARGV[0]\n"; $file{$fid++} = shift @ARGV; if ( $fid eq 'AA' ) { $fid = 'a'; } elsif ( $fid eq 'aa' ) { warn sprintf( "$0: stopped at 52 inputs (dropped last %d files +)\n", scalar @ARGV ); } } print "\n"; my %type; for my $fkey ( sort keys %file ) { open( I, $file{$fkey} ) or do { warn "$file{$fkey}: $!\n"; next }; while (<I>) { s/\s+$//; s/^\s+//; my ( $frq, $typ ) = split( /\s+/, $_, 2 ); $type{$typ}{src} .= $fkey; $type{$typ}{frq} += $frq; } } for my $typ ( sort keys %type ) { my $src = $type{$typ}{src}; $src =~ s/(\w)\1+/$1/g; # get rid of repeated letters (AAABBC -> + ABC) my $len = length( $src ); printf( "%d\t%d\t%s\t%s\n", $type{$typ}{frq}, $len, $type{$typ}{src}, $typ ); }

Comment on venn-list: produce union of histograms Download Code