in reply to •Re: Most frequent element in an array.
in thread Most frequent element in an array.

Agreed. Though the use of the ST wasn't to reduce the cost of sorting, just a mechanism of grouping the key/value pairs, so that I could sort the values and retain the associating with the respective key.

A linear search is all that is needed. That said, petruchio gave an alternative in the CB which I had to look at several times to understand, but which I think is particularly neat.

my @n; $n[$_{$_}] = $_ for map{$_{$_}++; $_} @list; print "Most frequent: $n[-1]";

I hope he'll forgive me for pushing this one step further with this sub which I have added to my personal utilities module.

sub most_frequent{ local *_=*_; $_[$_{$_}] = $_ for map{$_{$_}++; $_} +@_; $_[-1]; }

Which goes along way to providing, and could easliy be extended to provide most if not all of the function available in the Statistics::Frequency module I saw mentioned, without the overhead of the 50 or so lines of inefficient and frankly rather pedestrian code that make it up.

I find it incredulous that the author implemented a complete function and a nested loop to determine the "sum of the frequencies", which unless I am just too tired, amounts to the size of the list or array?

Just goes to show that you have to read the source before blythly accepting the merit of any given module. Just being a part of CPAN isn't of itself enough to ensure any sort of quality.


Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Replies are listed 'Best First'.
Re: Re: •Re: Most frequent element in an array.
by dug (Chaplain) on Feb 16, 2003 at 18:46 UTC
    my @n; $n[$_{$_}] = $_ for map{$_{$_}++; $_} @list; print "Most frequent: $n[-1]";

    That is a nice little snippet.

    I hope he'll forgive me for pushing this one step further with this sub which I have added to my personal utilities module.
    sub most_frequent{ local *_=*_; $_[$_{$_}] = $_ for map{$_{$_}++; $_} +@_; $_[-1]; }

    I don't think that does what you want it to do. It only returns the most frequent element if the frequency is greater than or equal to the last index of the array. For instance, if pass that function the list  qw( 1 1 2 3 );, $_[2] is set to 1, and $_[1] is set to 2 then 3. But $_[3] remains 3, and your code will return it.

    Which goes along way to providing, and could easliy be extended to provide most if not all of the function available in the Statistics::Frequency module I saw mentioned, without the overhead of the 50 or so lines of inefficient and frankly rather pedestrian code that make it up.

    This inefficient and pedestrian code you speak of is much more efficient than the broken code you posted. First I tried a one element list so your code couldn't break. Then I disregarded the fact that it breaks, and tried a slightly larger list. The results look good for Statistics::Frequency.

    #!/usr/bin/perl use warnings; use strict; $|++; use Statistics::Frequency; use Benchmark qw( cmpthese ); my @data_small = qw( bob ); my @data_bigger = qw( bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim ); cmpthese( 10_000, { mf_small => \&mf_small, sf_small => \&sf_small, } ); cmpthese( 2500, { mf_bigger => \&mf_bigger, sf_bigger => \&sf_bigger, } ); sub sf_small { my $f = Statistics::Frequency->new( @data_small ); my %f = reverse $f->frequencies; die "sf broken" unless $f{$f->frequencies_max} eq 'bob'; } sub sf_bigger { my $f = Statistics::Frequency->new( @data_bigger ); my %f = reverse $f->frequencies; die "sf broken" unless $f{$f->frequencies_max} eq 'bob'; } sub mf_small { my $f = most_frequent( @data_small ); die "mf broken" unless $f eq 'bob'; } sub mf_bigger { my $f = most_frequent( @data_bigger ); #die "mf broken" unless $f eq 'bob'; } sub most_frequent{ local *_=*_; $_[$_{$_}] = $_ for map{$_{$_}++; $_} @_; $_[-1]; } Benchmark: timing 10000 iterations of mf_small, sf_small... mf_small: 4 wallclock secs ( 2.56 usr + 0.54 sys = 3.10 CPU) @ 32 +25.81/s ( n=10000) sf_small: 1 wallclock secs ( 0.71 usr + 0.13 sys = 0.84 CPU) @ 11 +904.76/s ( n=10000) Rate mf_small sf_small mf_small 3226/s -- -73% sf_small 11905/s 269% -- Benchmark: timing 2500 iterations of mf_bigger, sf_bigger... mf_bigger: 23 wallclock secs (12.17 usr + 10.49 sys = 22.66 CPU) @ 1 +10.33/s (n= 2500) sf_bigger: 1 wallclock secs ( 1.11 usr + 0.14 sys = 1.25 CPU) @ 2 +000.00/s (n =2500) Rate mf_bigger sf_bigger mf_bigger 110/s -- -94% sf_bigger 2000/s 1713% --
    I find it incredulous that the author implemented a complete function and a nested loop to determine the "sum of the frequencies", which unless I am just too tired, amounts to the size of the list or array?

    I assume that you mean that you are incredulous, or that you find it incredible.

    Just goes to show that you have to read the source before blythly accepting the merit of any given module. Just being a part of CPAN isn't of itself enough to ensure any sort of quality.

    Yeah, especially if they are written by the CPAN's master librarian and co-author of Mastering Algorithms with Perl. In fact, I think I should distrust anything released by that author. Guess I better go downgrade my Perl.

    -- dug

      Okay. I admit it. My attempt to generalise petruchio's snippet was crass, ill-thought through and wrong. However, as much as my adaption of the routine itself was bad, it was also wrong because the original snippet displays the same inherent limitation as every other solution presented here, Namely, that none of them deal correctly with the situation of there being more than one element with equal highest frequency!

      For an example of this, try modifying your benchmark code to contain this line

      my @data_small  = qw( bob tom );
      and you'll see what I mean. The S::F module fails as you'll see from the fifth line of the results below. The msf_* routine are based upon my own attempt at the module. The code for which is included, along with then benchmark and full results below.

      C:\test>235765 Benchmark: timing 1 iterations of msf_small, sf_small... msf_small: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) sf broken at C:\test\235765.pl line 25. sf_small: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) Rate sf_small msf_small sf_small 1000000000000000/s -- 0% msf_small 1000000000000000/s 0% -- Benchmark: running msf_bigger, sf_bigger, each for at least 1 CPU seco +nds ... msf_bigger: 1 wallclock secs ( 1.04 usr + 0.00 sys = 1.04 CPU) @ 13 +4.49/s (n=140) sf_bigger: 1 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) @ 51 +.33/s (n=56) Rate sf_bigger msf_bigger sf_bigger 51.3/s -- -62% msf_bigger 134/s 162% --

      By way of atonement for my sins, I'll first explain some of the reasons why I was dismayed by what I saw in the module's source, which to be fair to the auther, is versioned as 0.03. And as it's always easy to snipe from the sidelines at other peoples code, I'll also post my attempt so that I may suffer the same fate of peer review as I am about to subject the author to.

      First I encountered this line in the elements() method

      exists $self->{data} ? keys %{$self->{data}} : wantarray ? () : un +def;

      which I think is better coded as

      return unless exists $self->{data}; keys %{$self->{data}};

      Not wrong exactly, but strange to my eyes none the less.

      Then there is this line in the add_data() method

             if ($ref eq ref $self) {

      whereas the equivalent line in the remove_data() method is coded as

             if ($ref && $data->isa(ref $self)) {

      Then there are two occurance of this dubious construction

      if (...) { ... } if (...) { ... } elsif (...) { ... } else { ... }

      Which if it isn't just plain wrong, it is certainly deceptive.

      Then I encountered this piece of code

      my $min; my $max = $min = exists $self->{data} ? $self->{data}->{each %{$se +lf->{data}}} : undef;

      Which seems to be a strange way of setting $min and $max to a random value.

      Anyway here's my attempt at basically the same module. I changed the names a bit as mentioning frequency on every method call in a module call Statistics::Frequency seemed redundant, though that's obviously the author style choice.

      I also omitted stuff (like the callback routine) which do not seem to be completed yet.

      This is just a first pass attempt that has undergone very limited testing and is in no way ready for production use, but it at least gives the author of Statistics:Frequency the opportunity to crtique back should he choose to. Or anyone else for that matter.

      I haven't attempted to pod it as I doubt it will ever see light of day beyond this posting, but I have included my test code in the module, which should serve as a pointer to how I intend it would be used.

      Full code and benchmark


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Re: Most frequent element in an array.
by toma (Vicar) on Feb 16, 2003 at 18:28 UTC
    Thanks for following up on my chatterbox inquiry!

    This is how I tried to use most_frequent().

    my @array= qw ( this that the the other this the off ); print most_frequent(@array),"\n"; sub most_frequent { local *_=*_; $_[$_{$_}] = $_ for map{$_{$_}++; $_} @_; $_[-1]; }
    The result printed is off instead of the expected the. Am I missing something?

    Update: I have benchmarked merlyn's code, Petruchio's CB suggestion, and the Statistics::Frequency approach. The clear winner is merlyn's code. It is faster and uses less memory than the other approaches.

    It should work perfectly the first time! - toma