String similarity extraction

Braindead_One has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!
I'm currently writing many online-gaming related scripts and thought it would be nice to have a routine that extracts clantags out of an array of playernames.

Clantags normally consist of 2 or more characters which are (mostly) at the beginning or end of a name but could also be somehere in the middle.

Normally all players of one team should have the tag in their names so i basically have to find them by seeking and extracting the similaritys.

My problem is: i have no real idea how to do that ;)
I already searched CPAN and found modules like String::Similarity and Algorithm::Diff but none of them seem to help me with my problem since i have to match more than 2 strings in order to find the right tag (Some players might be wearing no/ a different tag).

I already thought of splitting the string into substrings (with String::Substrings) and comparing the resulting arrays but that seems to be the solution with the most overhead.

I hope one of you can point me to a more efficient solution.

Some example player names could be:
jP|Azrael
jP|Blade
jP|Henry
(Clantag: jP|)

Jeff.ocr
pr!me.ocr
Lokren.ocr
(Clantag: .ocr)

Thanks in advance,
Braindead_One

Comment on String similarity extraction

Replies are listed 'Best First'.

Re: String similarity extraction
by Hofmator (Curate) on Jan 11, 2003 at 17:11 UTC

The following code should do what you want, provided that the tags always start at the same letter (from the beginning or end - see the DATA section for what I mean by that).

Instead of just creating the simple pairs you might be able to shift the player names characterwise against each other. Then you might get to a solution that doesn't contain the above stated restriction.

use strict;
use warnings;
use List::Util qw/reduce/;

sub _extract {
    my @names = @_;
    my @pairs;

# create all xor'd pairs
    for(my $i=0; $i<@names; $i++) {
        for (my $j=$i+1; $j<@names; $j++) {
            push @pairs, $names[$i] ^ $names[$j];
        }
    }

    no warnings 'once';
    my $or = reduce { $a | $b } @pairs;

    if ( $or =~ /\0+/ ) {
        my $index = $-[0];
        my $length = $+[0] - $-[0];
        return substr $names[0], $index, $length;
    } else { # match not successful, so return undef
        return;
    }
}

sub extract_tag {
    my @names = @_;
    my $tag;

    $tag = _extract(@names);

# try matching with reversed @names
# this finds common substrs at the end
    $tag = reverse _extract(map {scalar reverse $_} @names)
        unless defined $tag;

    return $tag;
}


my @names;
local $, = ':';
local $\ = "\n";

while (<DATA>) {
    chomp;
    print(extract_tag(@names)), @names = (), next unless /\S/;
    push @names, $_;
}
print extract_tag(@names);

__DATA__
jP|Azrael
jP|Blade
jP|Henry

Jeff.ocr
pr!me.ocr
Lokren.ocr

woRUTtan
hiRUTfango
biRUTff

salTAGo
blasTAGi
RipTAGu
[download]

-- Hofmator

[reply]
[d/l]

Re: Re: String similarity extraction

by Braindead_One (Monk) on Jan 11, 2003 at 19:08 UTC

[reply]

Re: String similarity extraction
by Zaxo (Archbishop) on Jan 11, 2003 at 16:49 UTC

The examples you give use a non-word character as a delimeter. Is that always the case? If so,

my @name_parts = split /(\W)/, $name;
my $clantag = exists $clanhash{$nameparts[0]}
              ? $nameparts[0] . $nameparts[1]
              : $nameparts[1] . $nameparts[2];
[download]

index

After Compline,
Zaxo

[reply]
[d/l]

Re: Re: String similarity extraction

by Braindead_One (Monk) on Jan 11, 2003 at 19:15 UTC

[reply]

Re: String similarity extraction
by theorbtwo (Prior) on Jan 11, 2003 at 23:05 UTC

Careful; what you want to do isn't possible to do 100% huristicly. Consider a clan which has two members, SC|Adam, and SC|Alex. The clantag isn't SC|A, even though that's teh longest common substring, it's SC|, and they both happen to have names beginning with an A.

Also, consider SCSuperBot, SCMegaBot, and SCWackoBot. The clantag is SC, not Bot, or both, but the longest common substring is Bot.

Anyway, don't let this discourage you too much, I just wanted to note it.

Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

[reply]

Re: String similarity extraction
by PodMaster (Abbot) on Jan 11, 2003 at 16:13 UTC

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Re: String similarity extraction

by Braindead_One (Monk) on Jan 11, 2003 at 16:30 UTC

The problem is that i have no control over the interface. The data is generated by players on online gameservers. The names are chosen by the clans and all i can do is reading them from logfiles or via udp.

[reply]