Yes, that is why I indicated similar to Lingua::EN::Tagger which definitely extracts "minimal" noun phrases.

If by "minimal noun phrases" you mean "single words that happen to be nouns", then yes, a tagger serves to extract those.

So how could I take advantage of this for my purpose?

This depends on what particular things you want to extract that go beyond just the individual words that get "noun" tags. Multi-noun referring expressions (e.g. "corner store", "Perl Hacker")? Phrases that include function words and/or adjectives? Arguments (subjects and/or objects) of verbs?

A typical approach is to start with some text that contains hand-marked examples of the things you want to extract, and then build a statistical model that assigns weights to the various contexts associated with those examples -- that is, to the various patterns of POS tags in and around each chunk to be extracted. Depending on the details of the project and data the models may need to include actual words in the targets and/or contexts as well as the POS tags. (Of course, the more training data you have, the better.)

Then run the model on a separate set of hand-tagged data to see how well it does. If it does reasonably well (not to many misses or false positives), then you're ready to put it to use on real data.


In reply to Re^3: Dutch Noun Phrases exctaction by graff
in thread Dutch Noun Phrases exctaction by vit

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.