knirirr has asked for the wisdom of the Perl Monks concerning the following question:

I'm having trouble with a problem that someone might hopefully be able to suggest an answer to. I need to iterate over a list of words, and count the number of words that are "the same" based only on letter content. For example, if the word array contains "opt", "top", "pot", "pit" and "tip" then that would count of three instances of one "word" containing o, p and t and two instances of one "word" containing p, i and t. Presumably some sort of hash storing the first occurrence of a unique word as a key with counts as a value, matching new words against all the keys and incrementing the value would be a way of doing it, but I'm not sure how to go about comparing the new words to the hash keys. Can anyone offer a suggestion, or suggest any other means of doing this?

Replies are listed 'Best First'.
Re: Matching words based on letter content
by friedo (Prior) on Jan 28, 2005 at 14:55 UTC
    What I would do is take each word, split up the letters, and sort them alphabetically. Then you have a hash key which will be the same for "top", "pot", and so on. Here is an example.

    use strict; use Data::Dumper; my @words = qw/opt top pot pit tip/; my %count; foreach my $w(@words) { my $key = join '', sort split '', $w; $count{$key}++; } print Dumper \%count;

    Output:

    $VAR1 = { 'opt' => 3, 'ipt' => 2 };

    Update: Added the D::D output.

      What I would do is take each word, split up the letters, and sort them alphabetically. Then you have a hash key which will be the same for "top", "pot", and so on. Here is an example.
      D'oh!
      It is rather simple when you think of it that way - thanks.
Re: Matching words based on letter content
by holli (Abbot) on Jan 28, 2005 at 14:55 UTC
    use strict; my %h; @_ = qw (opt pot top pit ipt); for ( @_ ) { $h{join "", sort split "", $_}++; } for ( keys %h ) { print "$_ counted $h{$_} times\n"; }

    holli, regexed monk
Re: Matching words based on letter content
by Anonymous Monk on Jan 28, 2005 at 15:15 UTC
    #!/usr/bin/perl use strict; use warnings; my ($w, @w, %w); while (<DATA>) { chomp; @w = (0) x 26; $w[$_]++ for map -0x61 + ord lc, /[a-z]/ig; push @{$w{"@w"}}, $_; } print "@$w\n" while (undef, $w) = each %w; __DATA__ opt top pot pit stoop topos pit opt top pot stoop topos
      mmmh, won't work with Unicode ;=)
Re: Matching words based on letter content
by dragonchild (Archbishop) on Jan 28, 2005 at 15:00 UTC
    This is anagrams done sideways. Sounds like homework to me.

    Something to consider - are "stoop" and "stop" considered the same? If they are, then the solutions by holli and friedo won't work.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      @_ = qw (opt pot top stoop stop pit ipt);
      it will print
      opst counted 1 times oopst counted 1 times opt counted 3 times ipt counted 2 times
      Update: you´re right, i misread that. But this does it:

      use strict; my %h; @_ = qw (opt pot top stoop stop pit ipt); for ( @_ ) { my $last; $h{join "", grep { if ( $_ eq $last ) { "" } else { $last = $_; $_ } } sort split "", $_}++; } for ( keys %h ) { print "$_ counted $h{$_} times\n"; }
      prints:
      opst counted 2 times opt counted 3 times ipt counted 2 times

      holli, regexed monk
      I is most definitely not homework. It is in order to list total base count in a load of amino acids I've got - it's too long unless I compress them by content and forget about the order of bases within the string. 'Stoop' and 'stop' are therefore different.