Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Retrieving data with largest count

by Angharad (Pilgrim)
on Jun 02, 2009 at 14:59 UTC ( #767618=perlquestion: print w/replies, xml ) Need Help??

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file that looks like this
1ct9B 3 1 1f7uA 2 3 1gaxA 2 1 1gpmA 3 5 1ihoA 6 4 1mopA 6 3 1ileA 2 5 1iq0A 2 3 1vlhB 5 3 1jhdA 5 4 etc ...
The first column contains an 'item' the second column contains a 'group number' and the third item is a count. What I need to do is, for every group, retrieve the item that has the largest count. So, for the data above, the results for each group would be:
1ileA 2 5 1gpmA 3 5 1jhdA 5 4 1ihoA 6 4
I'm sure this is a very simple task, but I've not been able to work out how to do it so far. I've just created lots of hashes and not really got anywhere. All help/advice much appreciated.

Replies are listed 'Best First'.
Re: Retrieving data with largest count
by ikegami (Patriarch) on Jun 02, 2009 at 15:08 UTC
    my %largest; while (<>) { my ($item, $group_id, $count) = split; next if $largest{$group_id} && $largest{$group_id}[2] >= $count; $largest{$group_id} = [ $item, $group_id, $count, $_ ]; } for (values(%largest)) { print($_->[-1]); }

    Keeps first on ties.

    Update: Added note at bottom.

      ikegami,
      I know that it is unlikely in this situation that there would be negative or even nil counts, but this code does the wrong thing in the following situation:
      ASDF 1 0 ASDF 1 -3
      Specifically, it first sets the high water mark to 0 and then to -3. (Originally, I didn't have the group the same - corrected)

      Cheers - L~R

        No it doesn't??? In:
        1ct9B 3 1 1f7uA 2 3 1gaxA 2 1 1gpmA 3 5 1ihoA 6 4 1mopA 6 3 1ileA 2 5 1iq0A 2 3 1vlhB 5 3 1jhdA 5 4 ASDF 1 0 ASDF 1 -3

        Out:

        1ihoA 6 4 ASDF 1 0 1gpmA 3 5 1ileA 2 5 1jhdA 5 4

        0 >= -3, so next is executed, so the record for -3 is ignored. The item with the highest value for group 1 (ASDF 1 0) is printed.

Re: Retrieving data with largest count
by jethro (Monsignor) on Jun 02, 2009 at 15:12 UTC

    Please post your code the next time, often a few small changes suffice to make such a program work.

    ... my %best; while (my $line=<YOURFILE>) { chomp $line; my ($item,$group,count)= split(/\s+/,$line); if ((not exists $best{$group}) or $best{$group}{count}<$count) { $best{$group}{count}=$count; $best{$group}{item}= $item; } } #to print them out foreach my $group ( keys %best) { print "$best{$group}{item} $group $best{$group}{count}\n"; }

    PS: You didn't specify what should happen when two items in a group have the same count.

      Many thanks for the advice so far. Much appreciated. In answer to your question, if two items have the same count, choosing an item at random is fine.
        "choosing an item at random is fine"

        and also the hardest option. All candidates in a group have to be retained until processing the input data is complete, then all groups have to be checked for multiple candidate items and a random item selected.

        If either 'first' or 'last' is also fine then either of those is easier to implement.


        True laziness is hard work
Re: Retrieving data with largest count
by Limbic~Region (Chancellor) on Jun 02, 2009 at 15:11 UTC
    Angharad,
    What you want to use is a high water mark algorithm per item. It would look something like this:
    #!/usr/bin/perl use strict; use warnings; my $input = $ARGV[0] or die "Usage: $0 <input>"; open(my $fh, '<', $input) or die "Unable to open '$input' for reading: + $!"; my %high; while (<$fh>) { chomp; my ($item, $group, $count) = split ' '; $high{$item} = $count if ! defined $high{$item} || $count > $high{ +$item}; } for my $item (keys %high) { print "$item high count was $high{$item}\n"; }

    I have purposely left this code not working exactly as required (group is omitted) to give you something to work towards. If you need more help - remember that the entire line is stored in $_ Also, this assumes that counts will always be numeric and that they will be whitespace delimited. For something much more fancy, have a look at How A Function Becomes Higher Order

    Update: You didn't mention what should happen if an item has more than one group with the same highest count. The algorithm can be modified to handle the first/last/all/random values in that situation but you will need to determine what is right for you.

    Cheers - L~R

      I have purposely left this code not working exactly as required (group is omitted) to give you something to work towards.

      Also, your code finds the highest count per item, not per group. Since each item probably only appears once, you're just printing out the input.

        ikegami,
        Yes, (group is omitted) means I didn't consider group which is what the OP asked for. The was intentional as indicated in the same comment. I originally misread the question and when I discovered my output didn't match the OPs (before posting), I decided to leave it broken. Teach a man to fish so to speak. I guess I should have made that more clear.

        Cheers - L~R

Re: Retrieving data with largest count
by bichonfrise74 (Vicar) on Jun 02, 2009 at 20:22 UTC
    A possible solution?
    #!/usr/bin/perl use strict; my %HoH; while (<DATA>) { my ($item, $group, $count) = split; if ( defined( $HoH{$item}{$group} ) ) { $HoH{$item}{$group} = $count if ( $HoH{$item}{$group} < $count + ); } else { $HoH{$item}{$group} = $count; } } foreach my $item ( keys %HoH ) { foreach my $group ( keys %{ $HoH{$item} } ) { print "$item $group $HoH{$item}->{$group}\n"; } } __DATA__ 1ct9B 3 1 1f7uA 2 3 1gaxA 2 1 1gpmA 3 5 1ihoA 6 4 1mopA 6 3 1ileA 2 5 1iq0A 2 3 1vlhB 5 3 1jhdA 5 4

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://767618]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2023-02-05 20:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (32 votes). Check out past polls.

    Notices?