vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I need a module for extracting noun phrases in Dutch. Something similar to Lingua::EN::Tagger. If somebody knows about such module or can share his/her program I would appreciate.

Replies are listed 'Best First'.
Re: Dutch Noun Phrases exctaction
by Anonymous Monk on May 07, 2011 at 00:33 UTC
    FWIW, you should be able to use Lingua::TreeTagger as TreeTagger says
    The TreeTagger has been successfully used to tag German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Greek, Portuguese, Chinese, Swahili, Latin, Estonian and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.
      Right, but I did not find there noun phrases extraction.
Re: Dutch Noun Phrases exctaction
by graff (Chancellor) on May 07, 2011 at 16:33 UTC
    Anything similar to a part-of-speech tagger has very little to do with parsing syntactic structures to identify phrasal components. POS tagging is an essential first step to parsing, but parsing is a very different (and much more difficult) process.

    The only cpan module related to human-language parsing appears to be Lingua::LinkParser, but the library it depends on has apparently not yet been extended to cover Dutch. (Extending it to Dutch would presumably be a fair amount of work.)

    In any case, it would make sense to be as explicit as possible in working out what range of structures you want to include among the "noun phrases" you need to extract. For example, assuming you could form a sentence in Dutch that is equivalent to the following example, how many noun phrases would it contain, and what would they be?

    Avoiding the improper use of technology for language analysis requires both engineering and linguistic expertise.

    (Hint: there are at least two syntactic ambiguities in that sentence, affecting the quantity and/or structure of noun phrases. These might or might not be present in a Dutch version, depending on how you choose to translate it.)

    This is why some people prefer to focus on subsets of things, like "named entities", or maybe "minimal" noun phrases that only comprise a limited range of POS sequences. In either case, having a POS tagger already in place is a big help.

      Yes, that is why I indicated similar to Lingua::EN::Tagger which definitely extracts "minimal" noun phrases.

      In either case, having a POS tagger already in place is a big help

      So how could I take advantage of this for my purpose?
        Yes, that is why I indicated similar to Lingua::EN::Tagger which definitely extracts "minimal" noun phrases.

        If by "minimal noun phrases" you mean "single words that happen to be nouns", then yes, a tagger serves to extract those.

        So how could I take advantage of this for my purpose?

        This depends on what particular things you want to extract that go beyond just the individual words that get "noun" tags. Multi-noun referring expressions (e.g. "corner store", "Perl Hacker")? Phrases that include function words and/or adjectives? Arguments (subjects and/or objects) of verbs?

        A typical approach is to start with some text that contains hand-marked examples of the things you want to extract, and then build a statistical model that assigns weights to the various contexts associated with those examples -- that is, to the various patterns of POS tags in and around each chunk to be extracted. Depending on the details of the project and data the models may need to include actual words in the targets and/or contexts as well as the POS tags. (Of course, the more training data you have, the better.)

        Then run the model on a separate set of hand-tagged data to see how well it does. If it does reasonably well (not to many misses or false positives), then you're ready to put it to use on real data.