comment on

Anything similar to a part-of-speech tagger has very little to do with parsing syntactic structures to identify phrasal components. POS tagging is an essential first step to parsing, but parsing is a very different (and much more difficult) process.

The only cpan module related to human-language parsing appears to be Lingua::LinkParser, but the library it depends on has apparently not yet been extended to cover Dutch. (Extending it to Dutch would presumably be a fair amount of work.)

In any case, it would make sense to be as explicit as possible in working out what range of structures you want to include among the "noun phrases" you need to extract. For example, assuming you could form a sentence in Dutch that is equivalent to the following example, how many noun phrases would it contain, and what would they be?

Avoiding the improper use of technology for language analysis requires both engineering and linguistic expertise.

(Hint: there are at least two syntactic ambiguities in that sentence, affecting the quantity and/or structure of noun phrases. These might or might not be present in a Dutch version, depending on how you choose to translate it.)

This is why some people prefer to focus on subsets of things, like "named entities", or maybe "minimal" noun phrases that only comprise a limited range of POS sequences. In either case, having a POS tagger already in place is a big help.

In reply to Re: Dutch Noun Phrases exctaction by graff
in thread Dutch Noun Phrases exctaction by vit

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.