Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Not strictly Perl, but interesting to text processors, I think: does anyone know of a good dictionary or thesaurus in easily machine-readable format? I want something whereby I can easily parse words down to their root word. E.g., in the sentence above we would end up with the following words:

Not strict Perl but interest to text process I think do anyone know of a good dictionary or thesaurus in easy machine read format

Replies are listed 'Best First'.
Re: Computer-readable thesaurus
by clemburg (Curate) on Oct 12, 2001 at 22:20 UTC
      Additional Information: Dan Brian the author of the above mentioned TPJ article has also founded the Linguana Project to produce an open source natural language processing system based on Wordnet,LinkParser, and other still not developed applications.
      A paper about Linguana you will find at Dan Brians Project Page.

      Hanamaki
Re: Computer-readable thesaurus
by MZSanford (Curate) on Oct 12, 2001 at 17:49 UTC
    Rather than a dictionary/thesaurus, might i suggest Lingua::EN::Infinitive ... or just the Lingua namespace in general.
    The requirements change because they don't know what they want, or how much they own you.
      Probably Lingua::Stem may be the module you want to try.

      Hanamaki
Re: Computer-readable thesaurus
by perrin (Chancellor) on Oct 12, 2001 at 18:11 UTC
    Jon Bjornstad's program to help his disabled friend read includes an interactive dictionary, which I think he got from Project Gutenberg. Take a look.
Re: Computer-readable thesaurus
by cheshirecat (Sexton) on Oct 13, 2001 at 01:16 UTC
    Hi,

    First post on perl monks, just signed up (great site)

    I think what you might be looking for is this,

    It's in the public domain
    http://www.dcs.shef.ac.uk/research/ilash/Moby/
    word lists, thesaurus etc

    Moby Hyphenator 185,000 entries fully hyphenated mhyph.tar.Z [980kB] Moby Language Word lists in five of the world's great languages mlang.tar.Z [2.3MB] Moby Part-of-Speech 230,000 entries fully described by part(s) of speech, listed in priori +ty order mpos.tar.Z [1.2MB] Moby Pronunciator 175,000 entries fully International Phonetic Alphabet coded mpron.tar.Z [3.1MB] Moby Shakespeare The complete unabridged works of Shakespeare mshak.tar.Z [2.3MB] Moby Thesaurus 30,000 root words, 2.5 million synonyms and related words mthes.tar.Z [12MB] Moby Words 610,000+ words and phrases. The largest word list in the world

    The Cheshire Cat (...is back)

Re: Computer-readable thesaurus
by mischief (Hermit) on Oct 13, 2001 at 19:30 UTC

    You might want to take a look at dict.org and the files on their ftp site. They have several databases available along with client and server software you can use for reference.

(tye)Re: Computer-readable thesaurus
by tye (Sage) on Oct 12, 2001 at 19:03 UTC
Re: Computer-readable thesaurus
by pjf (Curate) on Oct 12, 2001 at 18:51 UTC
    Most *nix systems come with a dictionary of words, commonly in /usr/dict/words or /usr/share/dict/words.

    The common spelling utility, ispell, also comes with its own dictionaries, although the format isn't quite as simple as that of /usr/dict/words. If you have ispell installed, then you might want to glance at /usr/lib/ispell or /usr/local/lib/ispell to see if you can spot them. (Look for .hash files).

    These obviously don't contain word definitions or roots, just the words themselves. However often you can infer the root word using English rules. Again, the ispell source code would probably be a useful start here, since it does exactly that.

    Cheers,
    Paul

Re: Computer-readable thesaurus
by Fletch (Bishop) on Oct 12, 2001 at 19:56 UTC

    For some value of `easy' you can always use LWP, HTML::TreeBuilder, and something like Merriam-Webster Online. Not perl but a starting place with the urls, these are some zsh functions I use:

    webster () { _gensearch $0 "http://www.m-w.com/cgi-bin/dictionary?va=" "$*" } thesaurus () { _gensearch $0 "http://www.m-w.com/cgi-bin/thesaurus?va=" "$*" }
      I'm pretty sure Merrian-Webster online would not like people bypassing their ad revenue in this fashion.

      I know I pay for the bandwidth on my site, and spend some time blocking agents that steal my content like that.

      contact publishers before 'using' their work is my advice.

      Tiago

      Update:
      I'm sorry if this was read as flame bait, not my intention. I'm not sorry to have brought up copyright and terms of service. This is not a technical issue, but a moral and sometimes legal one, that developers should be aware of when making agents for the web.

      For example, using most Finance::Quote:: modules are against the terms of service of the sites who provide this data.

        I'm sure they'd also not like for people to use lynx, which doesn't display adds. I'm sure they'd like for people not to use junkbuster or other ad-blocking proxies. I'm sure they'd like for people to mail them large envelopes full of cash.

        But they've put up their content on a publicly accessable web site. They're perfectly welcome (as are you) to implement whatever technological means to restrict access (of course most of those won't stop a truly determined person with the right know-how, but that's another issue :). But I see little reason to ask for permission to provide an URL which any webmonkey worth his bananas could deduce in under a minute with just a browser's `View Source' functionality. That URL does not magically give you any more access to their content than the form on their front page, just more convinient access.

        But this is getting off topic from the original question at hand.