http://qs1969.pair.com?node_id=670004

Quicksilver has asked for the wisdom of the Perl Monks concerning the following question:

I'm thinking of trying to write a concordance which would run against certain linguistic grammatical rules to ignore items such as articles, pronouns and so on to try and return better results than here is every word, number of occurences and line positions in the text file.

I've taken a look at the Concordance Generator which Kurt Kincaid posted to Perlmonks as a starting point but are there any specific Perl modules around that are worth taking a look at on CPAN or am I best off defining words (and variations) to be ignored in a return block and parsing against these?

Thanks in advance for any help.

  • Comment on Trying to set up a concordance using linguistic rules

Replies are listed 'Best First'.
Re: Trying to set up a concordance using linguistic rules
by jrtayloriv (Pilgrim) on Feb 25, 2008 at 15:36 UTC
Re: Trying to set up a concordance using linguistic rules
by apl (Monsignor) on Feb 25, 2008 at 15:01 UTC

      Given that WordNet explicitly excludes "determiners, prepositions, pronouns, conjunctions, and particles", I don't really see how a module providing an interface to this (otherwise excellent) resource might help the OP.

        The fact that WordNet only contains open-class words might be a benefit, actually. If it's not in WordNet, the OP probably doesn't want it in the concordance.

        ...except that that idealizes WN's coverage of English. You'd probably do well, though with the added heuristic: "and it's a short, uncapitalized word". (i.e. "Short, uncapitalized words that aren't in WN should probably be ignored.")

        ...and it ignores the problem of homonyms. (e.g. 'in' has 7 senses in WordNet)

        Nonetheless, felt the need to "defend" WN (since someone's quoting from the FAQ I wrote).

        Better solutions have been mentioned elsewhere in the thread. But, I'll also add that my approach to the problem would probably be more Information Retrieval oriented. I'd use KStem, a stemming algorithm whose output is (in the usual case) an actual word. It tries to correct some of the problems with Porter. That paper is pretty academic (but good). There's also downloadable Java code.

        Then I'd just use tf-idf style weights to pick out the interesting words.

        Seems easier than parsing, even if it's shallow parsing.

Re: Trying to set up a concordance using linguistic rules
by dragonchild (Archbishop) on Feb 25, 2008 at 15:12 UTC
    Look at what Lingua::EN::Inflect does. Its results aren't what you're looking for, but its data structures might be. At the very least, a good start would be "Can I inflect this word or not?" Articles and pronouns can't be inflected in and of themselves. (Well, not completely true, but a good start.)

    Another solution would be to get your basic "word|number|where" going, then blacklisting the various things you don't care about. You will eventually have to be build the blacklist into the parser because some pronouns ("Joe") are also nouns ("joe"), but only context can tell the difference (usually).


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      In English, pronouns inflect __more__ than other words not less (he, his, him; who, whose, whom; etc.). Also, I don't quite see how "Joe" with a capital letter is a pronoun; names are proper nouns.
Re: Trying to set up a concordance using linguistic rules
by hsmyers (Canon) on Feb 26, 2008 at 16:35 UTC

    I'd suggest that you pick from two broad choices. WordNet::SenseRelate or the less complicated approach using an exception list(freely available, just Google). The learning curve on the first is steepish and the results of the second are an '80%' solution but easy to put in place. A particularly good source of information is the journal 'Computers and the Humanities' (which may have stopped or changed names, your friendly reference librarian will know) found in most college libraries. It is almost entirely devoted to textual analysis of which a concordance is almost a beginners tool.

    Second from chess software this is my favorite subject---I've always wanted to generate one for the Lord of the Rings (and related) just for grins. It was also a good way to learn IBM 360 assembler ;)!

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."