artist has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I like to group together similar items. Each item is a pair. Pairs could be in reverse order. Item which may contain little spelling mistakes are considered similar.

Input:

A 	B
Cow 	Phone
Speaker Phone
Hello 	World
Phon	Speaker
Clock	Phon
Torld 	Hello
Hello 	Worlx
=================
Output Should look like
Clock	Phon
Cow	Phone

Hello	World   M
Hello	Worlx	M
Torld	Hello   M

Phon	Speaker	M
Speaker Phon	M



First task to build the data structure and then second would be to apply the sort order. Thanks for your ideas and suggestions.

artist

Replies are listed 'Best First'.
Re: Order Pairs
by gjb (Vicar) on Jul 01, 2003 at 15:08 UTC

    To find related words, you'll probably want to have a look at String::Approx and/or cpan:://Text::Levenshtein. For the data structure I'd use a set of sets (or a HoH) containing lists of two elements.

    The hardest part will be to determine the definition of a group of "similar" strings. I.e. to determine the cut-offs for the distance functions.

    Just my 2 cents, -gjb-

    Update: Text::WagnerFischer could be useful too.

Re: Order Pairs
by fglock (Vicar) on Jul 01, 2003 at 15:24 UTC
    use Text::Soundex 'soundex'; my @words = qw( Cow Phone Speaker Phone Hello World Phon Speaker Clock Phon Torld Hello Hello Worlx ); while( @words ) { push @pairs, [ shift @words, shift @words ]; } @pairs = map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { [ join( '' => sort { $a cmp $b } ( soundex @$_[0], soundex @$_[1] ) ), $_ ] } @pairs; for( @pairs ) { print "@$_\n"; }

    output:

    Cow Phone Clock Phon Torld Hello Hello World Hello Worlx Speaker Phone Phon Speaker

    Update: I've chosen soundex just because it was very easy to implement.
    Other string comparison methods might give better/more exact results.

      Soundex works for assisting manual matching on English last names, Metaphone is supposed to be somewhat nicer for that task and is implemented in Text::DoubleMetaphone and Text::Metaphone. But this problem shouldn't be solved with an Enlgish last-name approximation system. Others suggested an edit distance metric like String::Approx.