f77coder has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I would like to create a contextual/categorical histogram. I've looked at Histogram and others but them see to be only numeric.

I'm thinking here's the pseudo code but I'm getting confused on keys, hashes and arrays.

My input is grabbing line by line from a file, there are M columns

1. split the line into my @array=split @colref ?

2. shove results from 1 into a hash? my %hash= @array?

3. count duplicate items ? $counts{$_}++ for @my_array;

4. add counts to keys in hash? ;

5. get next line

6 check if key exists?

7 if key exists, increment count

else add new key

next line

example array

use Data::Dumper; $Data::Dumper::Sortkeys=1; my @my_array=('a','-2','3','b','0xffff','c','2','b','a','4','a','a','2 +00'); my %counts; $counts{$_}++ for @my_array; print Dumper(\%counts);
$VAR1 = { '-2' => 1, '0xffff' => 1, '2' => 1, '200' => 1, '3' => 1, '4' => 1, 'a' => 4, 'b' => 2, 'c' => 1 };

Many thanks for any help.

Replies are listed 'Best First'.
Re: Contextual/categorical Histogram
by Athanasius (Archbishop) on Aug 03, 2014 at 04:24 UTC

    Hello f77coder,

    It seems to me that the code you’ve shown is already doing most of what you need. I’ve tweaked it a bit and added some (admittedly naïve1) code to generate the histogram:

    OK, I’m fairly sure this isn’t what you wanted, but perhaps by explaining where it falls short you can clarify what you mean by a “contextual/categorical” histogram.

    Anyway, hope it helps,

    1Because it doesn’t attempt to scale the output when the frequencies become too large.

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Many thanks for the help. I'm looking to do a rolling contextual histogram as data arrives. This is like a poor man's data classifier.

      For comparing 2 histograms, are there fast (there are Gb of lines) methods for doing intersect? xor? join?

        Performing an intersection, xor (symmetric difference), or join (union) operation on two histograms is fairly straightforward:

        However, it is doubtful that this approach will scale to accommodate hashes containing gigabytes of data. For that scenario, you should probably be looking to use a database.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,