Inspired by a question in the CB. How to find the most frequent element in an array.
#! perl -slw use strict; my @array = map{ int rand 10 } 1 .. 100; my %counter; $counter{$_} ++ for @array; my $most_frequent = ( sort{ $b->[1] <=> $a->[1] } map{ [ $_, $counter{$_} ] } keys %counter )[0]->[0]; print $most_frequent;

Replies are listed 'Best First'.
•Re: Most frequent element in an array.
by merlyn (Sage) on Feb 16, 2003 at 07:28 UTC
    That's a pretty ineffective use of a Schwartzian Transform, since the "sorting function" is not expensive to compute at all. And, you don't really need a sort, because all you need is the max, and that can be done by a linear search.
    my @items = ( ... ); my %count; $count{$_}++ for @items; # presuming at least one item in @items: my ($winner, $winner_count) = each %count; while (my ($maybe, $maybe_count) = each %count) { if ($maybe_count > $winner_count) { $winner = $maybe; $winner_count = $maybe_count; } }

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Agreed. Though the use of the ST wasn't to reduce the cost of sorting, just a mechanism of grouping the key/value pairs, so that I could sort the values and retain the associating with the respective key.

      A linear search is all that is needed. That said, petruchio gave an alternative in the CB which I had to look at several times to understand, but which I think is particularly neat.

      my @n; $n[$_{$_}] = $_ for map{$_{$_}++; $_} @list; print "Most frequent: $n[-1]";

      I hope he'll forgive me for pushing this one step further with this sub which I have added to my personal utilities module.

      sub most_frequent{ local *_=*_; $_[$_{$_}] = $_ for map{$_{$_}++; $_} +@_; $_[-1]; }

      Which goes along way to providing, and could easliy be extended to provide most if not all of the function available in the Statistics::Frequency module I saw mentioned, without the overhead of the 50 or so lines of inefficient and frankly rather pedestrian code that make it up.

      I find it incredulous that the author implemented a complete function and a nested loop to determine the "sum of the frequencies", which unless I am just too tired, amounts to the size of the list or array?

      Just goes to show that you have to read the source before blythly accepting the merit of any given module. Just being a part of CPAN isn't of itself enough to ensure any sort of quality.


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        my @n; $n[$_{$_}] = $_ for map{$_{$_}++; $_} @list; print "Most frequent: $n[-1]";

        That is a nice little snippet.

        I hope he'll forgive me for pushing this one step further with this sub which I have added to my personal utilities module.
        sub most_frequent{ local *_=*_; $_[$_{$_}] = $_ for map{$_{$_}++; $_} +@_; $_[-1]; }

        I don't think that does what you want it to do. It only returns the most frequent element if the frequency is greater than or equal to the last index of the array. For instance, if pass that function the list  qw( 1 1 2 3 );, $_[2] is set to 1, and $_[1] is set to 2 then 3. But $_[3] remains 3, and your code will return it.

        Which goes along way to providing, and could easliy be extended to provide most if not all of the function available in the Statistics::Frequency module I saw mentioned, without the overhead of the 50 or so lines of inefficient and frankly rather pedestrian code that make it up.

        This inefficient and pedestrian code you speak of is much more efficient than the broken code you posted. First I tried a one element list so your code couldn't break. Then I disregarded the fact that it breaks, and tried a slightly larger list. The results look good for Statistics::Frequency.

        #!/usr/bin/perl use warnings; use strict; $|++; use Statistics::Frequency; use Benchmark qw( cmpthese ); my @data_small = qw( bob ); my @data_bigger = qw( bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim bob bob bob tom sally jim bob bob bob tom sally +jim ); cmpthese( 10_000, { mf_small => \&mf_small, sf_small => \&sf_small, } ); cmpthese( 2500, { mf_bigger => \&mf_bigger, sf_bigger => \&sf_bigger, } ); sub sf_small { my $f = Statistics::Frequency->new( @data_small ); my %f = reverse $f->frequencies; die "sf broken" unless $f{$f->frequencies_max} eq 'bob'; } sub sf_bigger { my $f = Statistics::Frequency->new( @data_bigger ); my %f = reverse $f->frequencies; die "sf broken" unless $f{$f->frequencies_max} eq 'bob'; } sub mf_small { my $f = most_frequent( @data_small ); die "mf broken" unless $f eq 'bob'; } sub mf_bigger { my $f = most_frequent( @data_bigger ); #die "mf broken" unless $f eq 'bob'; } sub most_frequent{ local *_=*_; $_[$_{$_}] = $_ for map{$_{$_}++; $_} @_; $_[-1]; } Benchmark: timing 10000 iterations of mf_small, sf_small... mf_small: 4 wallclock secs ( 2.56 usr + 0.54 sys = 3.10 CPU) @ 32 +25.81/s ( n=10000) sf_small: 1 wallclock secs ( 0.71 usr + 0.13 sys = 0.84 CPU) @ 11 +904.76/s ( n=10000) Rate mf_small sf_small mf_small 3226/s -- -73% sf_small 11905/s 269% -- Benchmark: timing 2500 iterations of mf_bigger, sf_bigger... mf_bigger: 23 wallclock secs (12.17 usr + 10.49 sys = 22.66 CPU) @ 1 +10.33/s (n= 2500) sf_bigger: 1 wallclock secs ( 1.11 usr + 0.14 sys = 1.25 CPU) @ 2 +000.00/s (n =2500) Rate mf_bigger sf_bigger mf_bigger 110/s -- -94% sf_bigger 2000/s 1713% --
        I find it incredulous that the author implemented a complete function and a nested loop to determine the "sum of the frequencies", which unless I am just too tired, amounts to the size of the list or array?

        I assume that you mean that you are incredulous, or that you find it incredible.

        Just goes to show that you have to read the source before blythly accepting the merit of any given module. Just being a part of CPAN isn't of itself enough to ensure any sort of quality.

        Yeah, especially if they are written by the CPAN's master librarian and co-author of Mastering Algorithms with Perl. In fact, I think I should distrust anything released by that author. Guess I better go downgrade my Perl.

        -- dug
        Thanks for following up on my chatterbox inquiry!

        This is how I tried to use most_frequent().

        my @array= qw ( this that the the other this the off ); print most_frequent(@array),"\n"; sub most_frequent { local *_=*_; $_[$_{$_}] = $_ for map{$_{$_}++; $_} @_; $_[-1]; }
        The result printed is off instead of the expected the. Am I missing something?

        Update: I have benchmarked merlyn's code, Petruchio's CB suggestion, and the Statistics::Frequency approach. The clear winner is merlyn's code. It is faster and uses less memory than the other approaches.

        It should work perfectly the first time! - toma