distr - show distribution of column values

Category:	Utility Scripts
Author/Contact Info	Corion
Description:	This program returns a quick tally of the different values for a column. My primary use for this program is to find out the most common date value in a file, to rename that file to that date. It is also very convenient to use this program to get a quick overview over the distribution of lengths, especially for numbers. Currently, I'm "confident" that I'm picking the right value as the maximum value if the value occurs in at least 60% of the rows of the sample I'm taking. This has shown to be sufficient, but better would be an estimator that determined the size of the sample or expanded the sample as long as there was not enough confidence in the "modus".
#!/usr/bin/perl -w use strict; use Getopt::Long; GetOptions( 'lines\|n:i' => \my $lines, 'column\|c:i' => \my $column, 'sep\|s:s' => \my $separator, 'transform\|f:s' => \my $transform, 'max\|m' => \my $maximum_only, ); $lines \|\|= 10000; $column \|\|= 1; $separator \|\|= ";"; # should from the input, but... $column--; # adjust from human to Perl my %vals; my @F; my $line=0; sub transform{ $_[0] }; if ($transform) { no warnings 'redefine'; eval <<CODE sub transform { $transform(\$_[0]) }; CODE }; FILE: for my $file (@ARGV) { my $fh; if ($file =~ /\.gz\|\.ebcdic/) { open $fh, '-\|', 'gzcat', $file or die "Couldn't open '$file': $!"; } else { open $fh, '<', $file or die "Couldn't open '$file': $!"; }; while (<$fh>) { @F = split /$separator/o; $vals{ transform($F[ $column ])}++; last FILE if $lines <= $line++; }; }; for (sort { $vals{$b} <=> $vals{$a} } keys %vals) { if ($maximum_only) { if ($vals{$_} / $lines > 0.6) { print "$_\n"; last } else { die "No confidence in '$_': Only $vals{$_} out of $lines v +alues match\n"; }; } else { print "$_: $vals{$_}\n"; }; };

Comment on distr - show distribution of column values Download Code


Perl Monk, Perl Meditation
	PerlMonks