update:Realised this looks long. It is quite long. The poster above has given you a very fine answer, but it's intended for quite a limited set of cases, all from the same root. I've tried to go a little bit deeper to identify smaller morphological elements in the words, abstractly, hence this is wordy.
++ interesting question. I've been playing around with it on the sly for an hour or two, but a project manager keeps coming over and asking me why his website's feature boxes are still broken, so I haven't got a complete answer for you, just some suggestions, which might be helpful or otherwise
To start with, I think you might need to give your programme a few more hints to try and get it to analyse your data. At the moment, you're giving it a bare english translation, and expecting it to be able to identify grammatical elements that might correspond to morphological elements in the original. Instead, you might make it a lot easier if you pre-analyse each instance into the grammatical parts that make it up. What I'm thinking you might end up with is a data structure that looks like this:
my %analysed = (
baSlar => ['plural'],
baSlarimiz => ['plural', 'possessed by us'],
baSimda => ['possessed by me', 'locative'],
);
You could identify any number of syntactic features in a given word this way: verbal moods or aspects, nominal cases, numbers or genders. I'm not sure whether each data set you're working on will come from the same root or not, but I'll assume they do (it's not a huge problem if they don't, though**). Then, extract the root using something like the subroutine supplied by the previous poster, or whatever:
sub findroot {
my @words = @_;
my %stems;
foreach ( @words ) {
my @letters = split //;
do {
$stems{join ('', @letters)}++
} while my $stem = pop(@letters);
}
# dump all the possible stems that don't match every word
map { delete $stems{$_} if $stems{$_} < scalar(@words) } keys %ste
+ms;
#return the stem - i.e. the longest common element
return [ sort { length $b <=> length $a } keys %stems ]->[0];
}
This will give you a set of strings that are groups of morphemes (the words without the roots). For each of these strings, you know it's got to contain a set number of individual morphemes representing the grammatical features. Eg.
imda: 'possessed by me', 'locative'
Assuming each of the grammatical elements is represented by a non-null string morpheme, there's a limited number of ways that 'imda' can indicate 'in my head'. You could generate all these permutations (this is where I got hassled and had to stop coding ... so this is broken:)
use Algorithm::Permute qw( permute );
# not so fast as other modules, but it compiled OK on cygwin
my @permutations = possibles('imda','possessed by me','locative');
sub possibles {
my ($string, @items) = @_;
my @permutations;
my $maxlength = length($string) - scalar(@items) + 1;
permute {
##### this is hardcoded
my @lengths = (2,1,1);
do {
my %perm;
my @split = getsplit($string, @lengths);
for ( my $j = 0; $j < @items; $j++ ) {
$perm{$items[$j]} = $split[$j];
}
push(@permutations, \%perm);
#print Dumper \%perm;
} while ( @lengths = nextlength($maxlength, @lengths) );
} @items;
}
sub getsplit {
my ($string, @lengths) = @_;
my @splits;
my $offset;
foreach (@lengths) {
push(@splits, substr($string, $offset, $_) );
$offset += $_;
}
return @splits;
}
###### THIS DOESN'T WORK
sub nextlength {
my ($maxlength, @lengths) = @_;
my $incrnext;
foreach (@lengths) {
if ( $_ >= $maxlength ) {
$incrnext = ( $;
$_ = 1;
}
else {
$_++ if $incrnext;
$incrnext = 0;
}
}
return if $incrnext;
return @lengths;
}
And then you'd have a set of guesses at the ways in which the suffix could be representing the grammatical form:
@possibles = ( {
'i' => 'locative',
'mda' => 'poss.by.me',
},
{
'im' => 'locative',
'da' => 'poss.by.me',
}, ........ );
You should be then able to cross reference all the different cases that you have for 'locative' or 'plural' or 'possessed by me/us', and see which permutations are true for all the different cases. Of course, this is a slightly 'brute force' method of approaching this problem, and the results are still likely to need some interpretation; however, it could save a lot of manual guessing. Having some knowledge of the phonemics of the language, or knowing one or two of the morphemes in advance is likely to make it A LOT easier.
Of course, all this assumes that your morphemes are all suffixes, not prefixes, and that there isn't anything tricksy like sandhi taking place between suffixes. But it might be a start for you.
Have fun
/=\
**update: it occurs to me that it doesn't really matter if you pre-strip the root at all. Instead, you could skip that step altogether, and just identify the root as another grammatical element, eg:
my %analysed = (
baSlar => ['root:head', 'plural'],
baSlarimiz => ['root:head', 'plural', 'possessed by us'],
baSimda => ['root:head', 'possessed by me', 'locative'],
);
|