Description: |
This program returns a quick tally of the different values for a column. My primary use for this program is to find out the most common date value in a file, to rename that file to that date.
It is also very convenient to use this program to get a quick overview over the distribution
of lengths, especially for numbers.
Currently, I'm "confident" that I'm picking the right value as the maximum value if the value occurs in at least 60% of the rows of the sample I'm taking. This has shown to be sufficient, but better would be an estimator that determined the size of the sample or expanded the sample as long as there was not enough confidence in the "modus". |
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
GetOptions(
'lines|n:i' => \my $lines,
'column|c:i' => \my $column,
'sep|s:s' => \my $separator,
'transform|f:s' => \my $transform,
'max|m' => \my $maximum_only,
);
$lines ||= 10000;
$column ||= 1;
$separator ||= ";"; # should from the input, but...
$column--; # adjust from human to Perl
my %vals;
my @F;
my $line=0;
sub transform{ $_[0] };
if ($transform) {
no warnings 'redefine';
eval <<CODE
sub transform { $transform(\$_[0]) };
CODE
};
FILE: for my $file (@ARGV) {
my $fh;
if ($file =~ /\.gz|\.ebcdic/) {
open $fh, '-|', 'gzcat', $file
or die "Couldn't open '$file': $!";
} else {
open $fh, '<', $file
or die "Couldn't open '$file': $!";
};
while (<$fh>) {
@F = split /$separator/o;
$vals{ transform($F[ $column ])}++;
last FILE if $lines <= $line++;
};
};
for (sort { $vals{$b} <=> $vals{$a} } keys %vals) {
if ($maximum_only) {
if ($vals{$_} / $lines > 0.6) {
print "$_\n";
last
} else {
die "No confidence in '$_': Only $vals{$_} out of $lines v
+alues match\n";
};
} else {
print "$_: $vals{$_}\n";
};
};
|