Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Perl and Morphology

by justinNEE (Monk)
on Mar 14, 2002 at 08:53 UTC ( [id://151634] : perlquestion . print w/replies, xml ) Need Help??

justinNEE has asked for the wisdom of the Perl Monks concerning the following question:

I'm interested in getting ideas on how to go about writing a program to take two lists of words and try to match morphemes. One list would be in English, the other list would be in a langauge that is known at runtime. The bound morphemes would be predictable(plural, tense, aspect...) but the number of "roots" would not be known until the program has gone through the lists. For example:
Data: baSlar,heads BaSlarimiz,our heads baSimda,in my head
Would return something like:
baS,head -lar,inflectional:plural -imiz,our - -imda,in my -
(or instead of 'in my - ' it would return a description.) These observations may not be true for the language, but they are true for the data that we have. When rules contradict eachother the program might look at the data closer to see if the rule is more complex, or it might decide that since the occurance of the rule is once out of x times, it is an exception, or that since two rules occur 50% each, they are both acceptable. The word lists would generally be around 100-200 entries... I'll try to get a bigger sample to play with tomorrow. I read the article in tpj #17 and while it was interesting, I still don't know where to start...

Replies are listed 'Best First'.
Re: Perl and Morphology
by ChOas (Curate) on Mar 14, 2002 at 11:27 UTC

    Think this might be a start ? .. only tested on your data: ;))
    #!/usr/bin/perl -w use strict; sub FindRoot; my %Word=map {chomp;split /,/} <>; for (keys %Word) { print "Word: $_, etc: $Word{$_}\n"; }; my $Root=FindRoot("Longest",keys %Word); my @Words=map {split} (values %Word); my $Word=FindRoot("Shortest",@Words); print "\n---\n$Root:$Word\n"; for (keys %Word) { print "-",substr($_,length$Root),":"; print "",substr($Word{$_},0,rindex($Word{$_},$Word)),"\n"; }; sub FindRoot { my $Order=shift; my %Container; my $Min="xxxxxxxxxxxxxxx"; print "Finding $Order\n"; if ($Order eq "Shortest") { $Min=""; $Min=((length $_)>(length $Min))?$_:$Min for (@_); } else { $Min=((length $_)<(length $Min))?$_:$Min for (@_); }; for my $Word (@_) { $Container{substr lc $Word,0,$_}++ for(2..length $Min); }; my $Root=""; $Min=0; my @List=sort keys %Container; @List=reverse@List if ($Order eq "Longest"); print "List: @List\n"; for (@List) { if ($Container{$_}>$Min) { $Root=$_; $Min=$Container{$_}; }; }; return $Root; };

    It`s probably not the cleanest code, but it was fun to do, so
    I thought I`d post it :)))
    I figure you can use this as a start to really solve your
    problem, it shouldn`t be too hard to once you got the Root,
    take the substring out of the hash, and start looking for the next
    one (using that FindRoot sub) which would give you `lar` ,
    making the next root `BasLar` ...

    Hmmmmm... actually I think I got that to work here...
    Looking forward to more test-data :))


    print "profeth still\n" if /bird|devil/;
Re: Perl and Morphology
by ViceRaid (Chaplain) on Mar 14, 2002 at 17:57 UTC

    update:Realised this looks long. It is quite long. The poster above has given you a very fine answer, but it's intended for quite a limited set of cases, all from the same root. I've tried to go a little bit deeper to identify smaller morphological elements in the words, abstractly, hence this is wordy.

    ++ interesting question. I've been playing around with it on the sly for an hour or two, but a project manager keeps coming over and asking me why his website's feature boxes are still broken, so I haven't got a complete answer for you, just some suggestions, which might be helpful or otherwise

    To start with, I think you might need to give your programme a few more hints to try and get it to analyse your data. At the moment, you're giving it a bare english translation, and expecting it to be able to identify grammatical elements that might correspond to morphological elements in the original. Instead, you might make it a lot easier if you pre-analyse each instance into the grammatical parts that make it up. What I'm thinking you might end up with is a data structure that looks like this:

    my %analysed = ( baSlar => ['plural'], baSlarimiz => ['plural', 'possessed by us'], baSimda => ['possessed by me', 'locative'], );

    You could identify any number of syntactic features in a given word this way: verbal moods or aspects, nominal cases, numbers or genders. I'm not sure whether each data set you're working on will come from the same root or not, but I'll assume they do (it's not a huge problem if they don't, though**). Then, extract the root using something like the subroutine supplied by the previous poster, or whatever:

    sub findroot { my @words = @_; my %stems; foreach ( @words ) { my @letters = split //; do { $stems{join ('', @letters)}++ } while my $stem = pop(@letters); } # dump all the possible stems that don't match every word map { delete $stems{$_} if $stems{$_} < scalar(@words) } keys %ste +ms; #return the stem - i.e. the longest common element return [ sort { length $b <=> length $a } keys %stems ]->[0]; }

    This will give you a set of strings that are groups of morphemes (the words without the roots). For each of these strings, you know it's got to contain a set number of individual morphemes representing the grammatical features. Eg.

    imda: 'possessed by me', 'locative'

    Assuming each of the grammatical elements is represented by a non-null string morpheme, there's a limited number of ways that 'imda' can indicate 'in my head'. You could generate all these permutations (this is where I got hassled and had to stop coding ... so this is broken:)

    use Algorithm::Permute qw( permute ); # not so fast as other modules, but it compiled OK on cygwin my @permutations = possibles('imda','possessed by me','locative'); sub possibles { my ($string, @items) = @_; my @permutations; my $maxlength = length($string) - scalar(@items) + 1; permute { ##### this is hardcoded my @lengths = (2,1,1); do { my %perm; my @split = getsplit($string, @lengths); for ( my $j = 0; $j < @items; $j++ ) { $perm{$items[$j]} = $split[$j]; } push(@permutations, \%perm); #print Dumper \%perm; } while ( @lengths = nextlength($maxlength, @lengths) ); } @items; } sub getsplit { my ($string, @lengths) = @_; my @splits; my $offset; foreach (@lengths) { push(@splits, substr($string, $offset, $_) ); $offset += $_; } return @splits; } ###### THIS DOESN'T WORK sub nextlength { my ($maxlength, @lengths) = @_; my $incrnext; foreach (@lengths) { if ( $_ >= $maxlength ) { $incrnext = ( $; $_ = 1; } else { $_++ if $incrnext; $incrnext = 0; } } return if $incrnext; return @lengths; }

    And then you'd have a set of guesses at the ways in which the suffix could be representing the grammatical form:

    @possibles = ( { 'i' => 'locative', 'mda' => '', }, { 'im' => 'locative', 'da' => '', }, ........ );

    You should be then able to cross reference all the different cases that you have for 'locative' or 'plural' or 'possessed by me/us', and see which permutations are true for all the different cases. Of course, this is a slightly 'brute force' method of approaching this problem, and the results are still likely to need some interpretation; however, it could save a lot of manual guessing. Having some knowledge of the phonemics of the language, or knowing one or two of the morphemes in advance is likely to make it A LOT easier.

    Of course, all this assumes that your morphemes are all suffixes, not prefixes, and that there isn't anything tricksy like sandhi taking place between suffixes. But it might be a start for you.

    Have fun


    **update: it occurs to me that it doesn't really matter if you pre-strip the root at all. Instead, you could skip that step altogether, and just identify the root as another grammatical element, eg:

    my %analysed = ( baSlar => ['root:head', 'plural'], baSlarimiz => ['root:head', 'plural', 'possessed by us'], baSimda => ['root:head', 'possessed by me', 'locative'], );
Re: Perl and Morphology
by ronald (Beadle) on Mar 14, 2002 at 22:39 UTC
    First, you have to decide the kinds of languages you want to be able to parse. If the list begins and ends with Turkish, you can simplify a lot, as you don't have to worry about non-concatenative kinds of morphology (e.g. reduplication, infixation, ablaut), or even prefixation, for that matter.

    You don't say whether your goal is to learn about parsing or simply to be able to parse Turkish words. If the latter, you can save yourself a lot of time and effort by using the parser available at:

    You get results like the following for 'baSlar':


    You can use Perl to submit words for parsing and then map the results onto English. You'll also need to preprocess the words to apply some phonological rules, like vowel harmony. For example, you won't get any results from 'baSimda' and have to submit as 'baSImda' instead. You could apply the harmony rules with s///, though if you apply harmony to all words, it will apply incorrectly to disharmonic roots and non-harmonizing suffixes. It's pretty hard to avoid that problem without first having the morphological parse!

Re: Perl and Morphology
by Maestro_007 (Hermit) on Mar 14, 2002 at 20:49 UTC

    As you can tell from my home node, I've taken an interest in this kind of thing. Most of my work is in phonology, so you'll have to excuse me if things point in that direction. I'm sure you've already thought of most of this, but I wanted to lay it out.

    Some considerations:

    The main thing that occurs to me is the danger of assuming 1) that the root is the longest element of a word, and that 2) there is only one root and one affix.

    There are plenty of languages (i.e. Basque, Russian, probably even English, though I can't think of examples) that have morphemes with more sounds than the root. I'll find some examples later when I have all my dictionaries around me :)

    There are plenty of languages (every one that I can think of) that allow compounding of roots, and much prepending/appending of affixes. Basque in particular allows many morphemes to be attached to a given word (I think it can get up to 6).

    I hope you don't have to deal with this, but you may have to consider circumfixes (one single morpheme that has parts before and after the root, like German past tenses e.g. 'ge-mach-t') and infixes (morphemes inserted into the root, the only example I can think of being the old Fish Called Wanda 'unbe-f**n-lievable').

    This leads me to encourage supplying the engine with a many-to-many set of words. Use the same root with different affixes, but also use the same affixes with different roots.

    Of course, your problem set is probably reduced to a single family of languages, so maybe you won't have to take all this into consideration, but these are the sorts of questions I had immediately.

    This is definitely a very studied problem, and though I think it can be solved for small situations and small data sets with relative ease, I'd encourage research into what Carnegie-Mellon, the University of Edinburgh, and the University of Texas have done in this direction.

    Finally, if you're going to work with English, you'll need to write everything phonetically. For example, The silent 'e' that gets deleted when adding a suffix that begins with a vowel may become a problem ('believ-able'). Once again, if you've done linguistics for ten minutes you know what a chore anything in English is.

    Hope this isn't too much. Good luck with this. I'd like to hear more about it if you get some good stuff working.


    update: Turkish! Cool! IPA is definitely the way to go, but the problem is: which IPA? Can you get the stuff to work in Unicode? If you can, you can do all sorts of normal pattern matching (regex) using Perl 5.6. If you only use Sil, I'm sure there's still a way to do it, but it may be more difficult. That's one of the principle things I'm working on (a bridge between Sil and Unicode), but haven't quite done yet.

    If I ever end up getting all my stuff done, we may be able to correspond on some of this stuff. Hope I wasn't overly cautionary there.

      i want the history of morphology
Re: Perl and Morphology
by justinNEE (Monk) on Mar 14, 2002 at 20:37 UTC
    Hey guys, thanks for the input, I agree that this program is going to have to know a little more about the English list and English in general. (This problem seemed like it was going to be easier at 1am) Unfortunately I have to waste tuesdays and thursdays(8am-7pm) in classes, so I won't get to input some real data until tonight. This sample will be Turkish phrases in the IPA's alphabet. I wonder if this is going to complicate things?

    update: Justin still doesn't know how to use anyway, how am I supposed to pass around this file? I don't suppose this is the right way: encoded text file