wolis has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm working on a little game using Perl to parse simple english phrases like 'create a small white mouse' and an object will be added to a database being a 'mouse' which has attributes of 'size=small' and 'colour=white'.

As much as I have the basics working and am enjoying solving this on my own, I thought I might see what others have done or think on this topic.

Has anyone done any work on parsing things like this?

or can anyone point me in the direction of some relevent text on this subject?

Thanks

___ /\__\ "What is the world coming to?" \/__/ www.wolispace.com

Replies are listed 'Best First'.
Re: Parsing english
by allolex (Curate) on Oct 07, 2003 at 09:18 UTC

    Unless you want your command entry to quickly become the main focus of your game, you might consider using a simplified grammar and lexicon that shoots for about 90% interpretation accuracy at about 70% precision. Just make sure you know what your verbs (V), nouns (N), and adjectives (A) are.

    So, as far as syntax is concerned, with English you have an advantage for commands. Commands start with verbs and have zero or more arguments having something to do with the verb's action. (I imagine you are *doing* stuff in your game, so I wouldn't bother accounting for stative verbs.)

    Your command boils down to:

    create(mouse)
    where mouse(small, white)

    So you have the combinations V(N) and N(A,A)... and there you have your objects and their properties. You might also want some stemming, so the Lingua:: modules on CPAN (Lingua::Stem::En) should be of some help.

    Each verb will have to have its arguments defined. Your example of "create" can have one argument type, which is whatever you are creating. You will also have to check to make sure the thing you are creating is creatable, i.e. your lexicon will have to know which actions apply. You might also need to account for movement. Luckily, movement is well-studied and very well-formalized. You move FROM (LOCATION) VIA (LOCATION) TO (LOCATION). Here, you can use keywords to map your path.

    I've got to run now, but I'll be more than happy to respond in more detail later. (I came back.)

    You might want to look at http://citeseer.nj.nec.com/ and search for "parsing english" (query results). You'll find lots of academic articles... somewhere there's bound to be some introductory material. If that fails, then look for a copy of Natural Language Understanding by James Allen (link) at the University of Rochester.

    Update: Fixed some sloppy grammar, added more detail.

    --
    Damon Allen Davison
    http://www.allolex.net

      This Lingua:: stuff looks intruiging.. but my quick skim over its surface found no 'concrete' examples of what it actually is what what it does..

      ..looks like I need that PhD afterall :-)

      ___ /\__\ "What is the world coming to?" \/__/ www.wolispace.com
Re: Parsing english
by Roger (Parson) on Oct 07, 2003 at 05:18 UTC
    Can you be more specific on what your little game is going to parse? Is it going to parse a set of strictly defined phrases in a specific order? Or is it going to parse what the players type in (like the old sierra advanture games)? Because the second type of text is much harder to parse.

    You can checkout the Parse::RecDescent module from CPAN if you decided to build it upon a set of strictly defined grammar.

    And in the second case, the simplest would be to grep for recognised words in the tokenised text. And then act upon the recognised words. Otherwise you would end up with the task of writing a natual language parser with functional dependent grammar... (Good for a PhD thesis perhaps?)

      Yes, I aspire for the latter.

      However I do assume the sentences being typed in would match a logical structure so:

      Create some sticks Create a pile of wooden sticks Create a great big pile of burning wooden sticks Create a small bundle of red sticks of dynamite Create a stick of dynamite
      All these would (and currently do) work in my parser. However I am working within the confines of creating objects so 'sticks are to be created that are made of wood and grouped into a pile' is outside my 'world view' :-) and ignoerd.. or accuratly 'said' to the other players not acted upon like a 'create' command.

      And yes you guessed it:

      Create a small white rabbit Put the rabbit on the sticks Put the dynamite on the rabbit etc..
      Will also logically be parsed and 'work' so players will see 'a pile of sicks with a small white rabbit on it. On the rabbit is a small bundle of red sticks of dynamite' etc..

      ___ /\__\ "What is the world coming to?" \/__/ www.wolispace.com
Re: Parsing english
by Abigail-II (Bishop) on Oct 07, 2003 at 10:44 UTC
    Parsing natural languages isn't easy. Just think about the ways to parse:
    Time flies like an arrow.
    If you want to parse English, go get a linguistics degree at a University.

    However, for a game you can get away with something else. Parsing simple sentences, which are often of the form:

    VERB [OBJECT [OBJECT]]
    Text based games have been around for over three decades, including Collossal Cave, Zork and thousands of muds. For many games, source is available, and mudlibs are available too. Granted, they are typically written in another language than Perl, but that shouldn't be a problem. The algorithms will remain the same, and they'll be much easier to implement in Perl than in C or Fortran.

    Abigail

      Time flies like an arrow?

      How about:   Fruit flies like a banana

      sorry

        Interesting example :)

        1) Fruit flies the way a banana flies. ('like' as conjunction)

        2) Fruit flies like (e.g. the taste of) a banana. ('like' as verb)

Re: Parsing english
by ViceRaid (Chaplain) on Oct 07, 2003 at 10:05 UTC

    As people have said above, you'll make your life a lot easier if you can constrain the complexity of the sentences which you're trying to parse, and perhaps also limit the vocabulary which can be employed by the user. If you can end up with a grammar like (in pseudo-regex-code)

    [VERB] the|a|some [ADJECTIVE]* [OBJECT]? \ [to|for|from|with INDIRECT-OBJECT]? [ADVERB]? "Create a small white mouse quickly." [V] [ADJ] [ADJ] [OBJ] [ADV] "Give some tasty peanuts to the mouse." [V] [ADJ] [OBJ] [I-OBJ] "Eat the mouse noisily." [V] [OBJ] [ADV]

    You can parse quite an expressive range of sentences especially if you know what parts of speech (verbs, nouns, adverbs etc.) different tokens are. This is the Parse::RecDescent approach.

    A deeper more "linguistic" approach might be to use something like a link parser, which is a package that analyses the structure of natural language sentences. There's several available free on the web, although I've never used one with Perl, only with Other Languages. There are lots of other free linguistic resources available on the web which you might find useful, including WordNet, which you can use to look up hyponyms, synonyms and hypernyms ....

    .... but this is probably all a bit much for making a small mouse ...

    ViceRaid

Re: Parsing english
by dragonchild (Archbishop) on Oct 07, 2003 at 14:44 UTC
    MUDs don't actually parse English. They parse commands. So, let's say you type in "kill floober with sword". What happens is roughly analagous to:
    my @command = split /\s+/, $input; my $cmd_name = shift @command; complain("$cmd_name isn't a command\n") unless exists $dispatch_table{ +$cmd_name}; $dispatch_table{$cmd_name}->(@command); # And the function called with handle "floober with sword"

    Good MUDs (which was a small minority ten years ago) will strip out words like a, the, and the like, so that parsing is easier.

    Better MUDs will use those words to help figure things out. So, you could say something like "kill all the floobers with my magic sword" and the MUD will actually set your attack flag to attack all the floobers in the room and will use your magic sword (as opposed to your non-magic sword or your magic spear). But, that command pre-processing is difficult to locate because it does a common activity, but (potentially) requires a ton of information that crosses all the data structures. (The room, the character, the other PCs/NPCs in the room, etc.)

    (The standard DikuMUD would complain "I see no 'floobers' in this room!" or some such if you tried the second line.)

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Parsing english
by halley (Prior) on Oct 07, 2003 at 14:49 UTC
    I've long looked at making a generic parser and command object model that could handle all of the command sentences in any Infocom game. I generalized further, and developed this basic sentence structure. I've implemented code to do this in Java, C and Perl over the years.
    [ subject , ] verb [ adverb ] [ dobject | dobjectlist | "stringliteral" ] [ preposition iobject ] [ . | ? | ! ]
    The verb is the only requirement.

    Real subjects, dobjects and iobjects all follow the basic grammar:

    [ article ] [ adjectives ] noun

    There are quite a few alternative grammars to the main sentence type, but the overall fields are fixed and once determined, all have the same meaning. For instance, it's okay to type the adverb before the verb. The adverb usually describes a different tradeoff but the same basic verb behavior (run quickly) vs (run quietly). I would recommend against supporting multiple adverbs, especially adverbs modifying adverbs (like 'very').

    The subject, if specified, must be first and followed by a comma. It's up to the subject to "consent" to the request; they can decide for themselves whether or not to allow the command (floyd, give me the circuit board).

    The dobject is either singular, or a list of objects, or a string literal. If a list of objects, the word "and" and/or a comma must separate items. Special pseudo-articles such as qw(all some the my) can help a search strategy for multiple objects within a given search domain (put all goo in the box). Lastly, a string literal is used for things like dialogue (say "hello" to floyd). An alternative sentence grammar would assume that if the sentence consists only of a string literal, then the verb is either 'say', 'exclaim' or 'ask' depending on any final punctuation.

    The overall effect of multiple dobjects is a simple iteration, with the sentence applied once identically to each dobject. Throw exceptions to interrupt the processing if desired.

    Iobjects are always singular prepositional targets. An alternative sentence syntax allows iobject to precede dobject, but it really swaps them and supplies a default preposition (give floyd the broom) becomes (give the broom to floyd). This is detected while parsing by noting the missing comma/'and' between two noun phrases.

    There's a lot more to my scheme; as I said I have developed the code but it's not something I can freely share in detail at this time. You're welcome to e-mail for other ideas, though.

    --
    [ e d @ h a l l e y . c c ]

Re: Parsing english
by EdwardG (Vicar) on Oct 07, 2003 at 12:19 UTC
    As much as I have the basics working...

    Really?

    I'd like to see your code if you're feeling brave enough.

    I once (maybe ten years ago) tried thinking about algorithms for parsing natural language and very quickly concluded that I wasn't up to the job. As Abigail so forthrightly intimates, I was probably missing a linguistics degree (or PhD more like).

    Professor Higgins I ain't.

      Thanks for all of your useful comments ppl

      Here is my basic code (very un-commented at present) and not very elagent.

      It doesnt do anything with the 'attributes' yet.. this will be looking up in the database (finding objects of class=Attribute name={value})

      Please dont hold back and rip apart/suggest/improve where applicable.

      $qtys{'the'}=0; $qtys{'some'}=20; $qtys{'bunch'}=30; $qtys{'pile'}=40; $qtys{'many'}=50; $qtys{'heaps'}=100; $qtys{'heap'}=100; print &parse_obj('a small white fluffy cat called sam'); print &parse_obj('a pile of sticks made of wood'); print &parse_obj('an old orange'); print &parse_obj('the broken golden cup of plenty'); print &parse_obj('a player called bob'); print &parse_obj('3 blind mice'); print &parse_obj('a piece of old triangular wooden boat called fred'); print &parse_obj('a kind of mouse'); print &parse_obj('some roses'); sub parse_obj { my $this_obj = $_[0]; if($this_obj =~ /(.+) made of (.+)/i) { $this_obj = $1; $material=$2; }else{ } $this_obj =~ /(\w+)\s?(.+)/; my $qty = $1; my $rest = $2; my $pre=''; my $class=''; my $name = ''; my $qty_type=''; my $attribs=''; my $material=''; if($qty =~ /the/i) { # handle special named one off objects 'the void' 'the sword of ligh +t' # check for specially typed objects.. if($rest =~ /(.+) of (.+)/i){ $pre=$1; $name='of '.$2; }else{ $pre=$rest; } }elsif($rest =~ /(.+) of (.+)/i){ $pre=$1; $class=$2; if($pre =~ /(.+) (.+)/){ $pre = $1; $qty_type=$2; }else{ $qty_type=$pre; $pre =''; } }else{ $pre=$rest; } $pre .= ' '.$class; if(($pre =~ /(.+) called (\w+)/)||($pre =~ /(.+) named (\w+)/)){ $pre = $1; $name = $2; $class=''; } if($pre =~ /(.+) (\w+)/){ $attribs=$1; $class=$2; }else{ if($name ne ''){ $class=$pre; $attribs=''; }else{ $class=$pre; } } if($qtys{$qty_type} ne ''){ # convert 'a pile of' into 'pile' = 30 $qty = $qtys{$qty_type}; $qty_type=''; } $qty=($qtys{$qty} eq '') ? 1 : $qtys{$qty}; return "\n-----------------\n>>$this_obj\n qty[$qty] size=[$size] typ +[$qty_type] att[$attribs] mat=[$material] class[$class] name[$name]\n +"; }
      ___ /\__\ "What is the world coming to?" \/__/ www.wolispace.com
Re: Parsing english
by PetaMem (Priest) on Oct 07, 2003 at 14:26 UTC

      Hi fatvamp,

      I have been looking at regex possibilities with perl that tend a bit in the direction of NLP, and consequently came across your nodes and website.

      You say of Lingua::LinkParser 'Quite nice for toy systems.' I don't have (much) experience with it yet but it looks very well-done and well-behaving to me when I paste in some test sentences.

      May I ask how negative/positive/experienced you are about the module, and more especially the underlying LINK Parser, and its quality now? (I realise your post/opinion is a year old)

      Thanks!

Re: Parsing english
by artist (Parson) on Oct 08, 2003 at 04:33 UTC
    You always want to convey some message via your interface. Your messages can be known in advance. You may be able to create a system which ask qustions to recognize your input in the known format. ie.. Presenting your on input into different format that might be acceptable to you. Which is offcourse 2 step process but may help you.

    The message conveyed can be put at definite place in the in the scheme of things than if would become more easier to ask questions.

    Consider the case:

    Travel Booking:: "I want to go from London to New York."

    If it recognizes 'from London' and 'To New York' that would be it. Because it has definte templates to handle the information.

    artist
    ===================================================
    Perl is fast.. So I spend more time doing fast things.
Re: Parsing english
by toma (Vicar) on Oct 09, 2003 at 05:31 UTC
    WordNet is a lexical database for the English language.

    CiteSeer is an excellent site for finding scholarly papers on this sort of thing. Look for papers on ontology.

    If you limit your problem to a little world, you can do a very nice job and only miss awkward phrases. I would be interested to know how much effort is required to scale up a small world, and whether tools like WordNet can be leveraged to reduce this effort.

    Don't be discouraged by a lack of reported successes in this area. Much of the work is outside of public view.

    It should work perfectly the first time! - toma