in reply to Contextual/categorical Histogram

Hello f77coder,

It seems to me that the code you’ve shown is already doing most of what you need. I’ve tweaked it a bit and added some (admittedly naïve1) code to generate the histogram:

#! perl use strict; use warnings; use Data::Dump; use List::Util 'max'; # 1. Configuration use constant UNIT => '* '; my @required_keys = qw(5 a foo); # 2. Read in and count the data my %counts = map { $_ => 0 } @required_keys; while (<DATA>) { ++$counts{$_} for split; } dd \%counts; # Verify hash contents # 3. Generate the histogram print "\nHistogram:\n"; my $max_len = max map { length } keys %counts; for (sort keys %counts) { printf "%*s: ", $max_len, $_; print UNIT for 1 .. $counts{$_}; print "\n"; } __DATA__ a -2 3 b 0xffff c 2 b a 4 a a 200 0xffff 17 a a c 3 200 201 b -2 b a b c a a 2 c -2

Output:

14:14 >perl 958_SoPW.pl { "-2" => 3, "0xffff" => 2, "17" => 1, "2" => 2, "200" => 2, "201" => 1, "3" => 2, "4" => 1, "5" => 0, "a" => 9, "b" => 5, "c" => 4, "foo" => 0, } Histogram: -2: * * * 0xffff: * * 17: * 2: * * 200: * * 201: * 3: * * 4: * 5: a: * * * * * * * * * b: * * * * * c: * * * * foo: 14:14 >

Note: If keys are known in advance, they can be added to @required_keys; this allows zero-frequency keys to appear in the histogram.

OK, I’m fairly sure this isn’t what you wanted, but perhaps by explaining where it falls short you can clarify what you mean by a “contextual/categorical” histogram.

Anyway, hope it helps,

1Because it doesn’t attempt to scale the output when the frequencies become too large.

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^2: Contextual/categorical Histogram
by f77coder (Beadle) on Aug 03, 2014 at 18:35 UTC

    Many thanks for the help. I'm looking to do a rolling contextual histogram as data arrives. This is like a poor man's data classifier.

    For comparing 2 histograms, are there fast (there are Gb of lines) methods for doing intersect? xor? join?

      Performing an intersection, xor (symmetric difference), or join (union) operation on two histograms is fairly straightforward:

      However, it is doubtful that this approach will scale to accommodate hashes containing gigabytes of data. For that scenario, you should probably be looking to use a database.

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,