comment on

Okay. I admit it. My attempt to generalise petruchio's snippet was crass, ill-thought through and wrong. However, as much as my adaption of the routine itself was bad, it was also wrong because the original snippet displays the same inherent limitation as every other solution presented here, Namely, that none of them deal correctly with the situation of there being more than one element with equal highest frequency!

For an example of this, try modifying your benchmark code to contain this line

my @data_small  = qw( bob tom );

and you'll see what I mean. The S::F module fails as you'll see from the fifth line of the results below. The msf_* routine are based upon my own attempt at the module. The code for which is included, along with then benchmark and full results below.

C:\test>235765
Benchmark: timing 1 iterations of msf_small, sf_small...
 msf_small:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)

sf broken at C:\test\235765.pl line 25.

  sf_small:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)

                        Rate  sf_small msf_small
sf_small  1000000000000000/s        --        0%
msf_small 1000000000000000/s        0%        --
Benchmark: running msf_bigger, sf_bigger, each for at least 1 CPU seco
+nds ...

msf_bigger:  1 wallclock secs ( 1.04 usr +  0.00 sys =  1.04 CPU) @ 13
+4.49/s (n=140)
 sf_bigger:  1 wallclock secs ( 1.09 usr +  0.00 sys =  1.09 CPU) @ 51
+.33/s (n=56)

             Rate  sf_bigger msf_bigger
sf_bigger  51.3/s         --       -62%
msf_bigger  134/s       162%         --
[download]

By way of atonement for my sins, I'll first explain some of the reasons why I was dismayed by what I saw in the module's source, which to be fair to the auther, is versioned as 0.03. And as it's always easy to snipe from the sidelines at other peoples code, I'll also post my attempt so that I may suffer the same fate of peer review as I am about to subject the author to.

First I encountered this line in the elements() method

    exists $self->{data} ? keys %{$self->{data}} : wantarray ? () : un
+def;
[download]

which I think is better coded as

     return unless exists $self->{data};
     keys %{$self->{data}};
[download]

Not wrong exactly, but strange to my eyes none the less.

Then there is this line in the add_data() method

if ($ref eq ref $self) {

whereas the equivalent line in the remove_data() method is coded as

if ($ref && $data->isa(ref $self)) {

Then there are two occurance of this dubious construction

    if (...) {
        ...
    } if (...) {
        ...
    } elsif (...) {
        ...
    } else {
        ...
    }
[download]

Which if it isn't just plain wrong, it is certainly deceptive.

Then I encountered this piece of code

    my $min;
    my $max = $min = exists $self->{data} ? $self->{data}->{each %{$se
+lf->{data}}} : undef;
[download]

Which seems to be a strange way of setting $min and $max to a random value.

Anyway here's my attempt at basically the same module. I changed the names a bit as mentioning frequency on every method call in a module call Statistics::Frequency seemed redundant, though that's obviously the author style choice.

I also omitted stuff (like the callback routine) which do not seem to be completed yet.

package My::Stats::Freq;
use strict;
use constant MAXINT        => 2**32; #! Could be bigger. Must be defin
+ed somewhere?
use constant CLASS        => 0;
use constant THIS        => 0;
use constant ELEMENT    => 1;
use constant FREQUENCY    => 1;

my (%selves);

sub new{
    my $self = bless \rand, shift;
    $self->clear();
    $self->add( @_ );
    return $self;
}

sub clear{ $selves{$_[THIS]} = [{}, 0, MAXINT, 0]; return; };

use constant DATA    => 0;
use constant SUM    => 1;
use constant MIN    => 2;
use constant MAX    => 3;

sub add{
    my $self = $selves{+shift};
    return $self->[SUM] unless @_;
    my $ref = ref $_[0];
    if ($ref eq 'HASH') {
        for my $key (keys %{$_[0]}) {
            $self->[SUM]        += $_[0]->{$key};
            $self->[DATA]{$key} += $_[0]->{$key};
        }
    }
    elsif ($ref eq 'ARRAY') {
        $self->[DATA]{$_}++ for(@{$_[0]});
    }
    else {
        $self->[DATA]{$_}++ for(@_);
    }
    _stats($self);
    return $self->[SUM];
}

sub _stats {
    my $self = shift;
    @{$self}[SUM, MIN, MAX] = (0, MAXINT, 0);
    for(values %{$self->[DATA]}) {
        $self->[SUM] += $_;
        $self->[MAX] < $_ and $self->[MAX] = $_;
        $self->[MIN] > $_ and $self->[MIN] = $_;
    }
}

sub remove{
    my $self = $selves{+shift};
    return $self->[SUM] unless @_;
    while(my $next= shift) {
        my $ref = ref $next;
        push(@_, @{$next}), next if $ref eq 'ARRAY';
        if ($ref eq 'HASH') {
            for (keys %{$next}) {
                $self->[DATA]{$_} -= $next->{$_};
                $self->[SUM]      -= $next->{$_};
            }
            next;
        }
        next unless $self->[DATA]{$next} > 0;
        $self->[SUM]--;
        $self->[DATA]{$next}--;
    }
    _stats($self);
    $self->[SUM];
}

sub remove_elements{
    my $self = $selves{$_[THIS]};
    $self->[SUM] -= delete $self->[DATA]{$_} for @_[1..$#_];
    ($self->[MAX], $self->[MIN]) = (0, MAXINT);
    _stats();
    $self->[SUM];
}

sub max{ $selves{$_[THIS]}[MAX]; }
sub min{ $selves{$_[THIS]}[MIN]; }
sub sum{ $selves{$_[THIS]}[SUM]; }

sub elements{    keys   %{ $selves{$_[THIS]}[DATA] }; }
sub frequencies{ values %{ $selves{$_[THIS]}[DATA] }; }
sub as_hash{            %{ $selves{$_[THIS]}[DATA] }; }

sub frequency{ $selves{$_[THIS]}[DATA]{$_[ELEMENT]}; }

sub by_frequency{
    grep{ $selves{$_[THIS]}[DATA]{$_} == $_[FREQUENCY]; } $_[THIS]->el
+ements();
}

sub proportion{
    my $self = $selves{$_[THIS]};
    return unless exists $self->[DATA]{$_[ELEMENT]};
    $self->[DATA]{$_[ELEMENT]} / $self->[SUM];
}

sub proportions{
    my $self = $selves{$_[THIS]};
    map{  $_ => $self->[DATA]{$_} / $self->[SUM] } $_[THIS]->elements(
+);
}

#! Should consider what happens if we are subclassed?
sub clone{ __PACKAGE__->new( { $_[THIS]->as_hash() } ) };


return 1 if caller();

package main;
local $\=$/, $, = $" = ', '; #!"

my $msf = My::Stats::Freq->new( map{ chr 65+rand 26 } 1..100); #! 100 
+elements
print $msf->max(), $msf->min(), $msf->sum();
print scalar $msf->elements, $msf->elements();
print scalar $msf->frequencies, $msf->frequencies();

my %hash = $msf->as_hash();
{local $\; while(my ($k,$v)=each%hash){ print "$k=>$v,"; } }; print'';

print $msf->by_frequency( $msf->max() );

#! add 152 elements or maybe less if the random generator didn't it th
+em all.
print $msf->add( [$msf->elements()] );
print $msf->add( $msf->elements() );
print $msf->add( \%hash );
print scalar $msf->elements, $msf->elements();
print scalar $msf->frequencies, $msf->frequencies();
print $msf->min(), $msf->max(), $msf->sum();
print $msf->by_frequency( $msf->max() );

#! remove them again
print $msf->remove( [$msf->elements()] );
print $msf->remove( $msf->elements() );
print $msf->remove( \%hash );
print $msf->elements();
print $msf->frequencies();
print $msf->min(), $msf->max(), $msf->sum();
print $msf->by_frequency( $msf->max() );

#! can we clone? Is the test sufficient?
my $clone = $msf->clone();
my ($msf_stringy, $clone_stringy) = ("@{[ $msf->as_hash() ]}", "@{[ $c
+lone->as_hash() ]}");
print $msf_stringy eq $clone_stringy
    ? 'Cloned OK'
    : 'Clone failed' . $/ . $msf_stringy . $/ .$clone_stringy;

#! Test the proportion stuff though this can easily be derived by the 
+user?
print $msf->proportion( $msf->by_frequency( $msf->max() ) );
print $msf->proportions();

#! remove all the elements and check the stats.
$msf->remove_elements( $msf->elements );
print $msf->min(), $msf->max(), $msf->sum();
print $msf->elements;
print $msf->frequencies;
[download]

This is just a first pass attempt that has undergone very limited testing and is in no way ready for production use, but it at least gives the author of Statistics:Frequency the opportunity to crtique back should he choose to. Or anyone else for that matter.

I haven't attempted to pod it as I doubt it will ever see light of day beyond this posting, but I have included my test code in the module, which should serve as a pointer to how I intend it would be used.

Full code and benchmark

#! perl -slw
use strict;
use Statistics::Frequency;
use My::Stats::Freq;
use Data::Dumper;
use Benchmark qw( cmpthese );

my @data_small  = qw( bob tom );

my @data_bigger = (qw[bob bob bob tom sally jim bob bob bob tom sally 
+jim]) x 100;

cmpthese( 1, {
    msf_small   => \&msf_small,
    sf_small  => \&sf_small,
});

cmpthese( -1, {
    msf_bigger  => \&msf_bigger,
    sf_bigger => \&sf_bigger,
});

sub sf_small {
    my $f = Statistics::Frequency->new( @data_small );
    my %f = reverse $f->frequencies;
    warn "sf broken" unless $f{$f->frequencies_max} eq 'bob';
}

sub sf_bigger {
    my $f = Statistics::Frequency->new( @data_bigger );
    my %f = reverse $f->frequencies;
    warn "sf broken" unless $f{$f->frequencies_max} eq 'bob';
}

sub msf_small {
    my $f = My::Stats::Freq->new( @data_small );
    my @max = $f->by_frequency( $f->max() );
    warn "msf broken" unless "@max" eq 'tom bob';
}

sub msf_bigger {
    my $f = My::Stats::Freq->new( @data_bigger );
    my ($max) = $f->by_frequency( $f->max() );
    warn "msf broken" unless $max eq 'bob';
}
[download]

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

In reply to Re:^4 Most frequent element in an array. (Atonement) by BrowserUk
in thread Most frequent element in an array. by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.