Bod (Curate)
You make it sound so easy choroba :)

Having done a few (simpler) things with language, I guess that finding the stem of each word is the trickiest part.

choroba (Archbishop)
    Well, I have a PhD in mathematical linguistics. Stemming was done in the first year ;-)

      Well, I have a PhD in mathematical linguistics

      Wow! - Genuinely impressive.

      Can I ask your opinion on Hemingway Editor?
      I use it extensively in producing content for our business marketing, blogs, etc. But I have started writing something to perform a similar task but more tailored to our needs. For example, in marketing the ratio of first person to second person pronouns is (thought to be) important. My version makes extensive use of Lingua::EN::Fathom.

      My attempt is not very far developed and I'd love some informed input before I go much further.

        In fact, the idea is craftily clever. Their stemmer and parser can only stem and parse simple sentences, so if it can't process the sentence with a sufficiently high certainty, they flag it as too complex :-)

        I don't know what technology they use in the editor. Also, I quit academia almost ten years ago, so things might have moved a bit since I worked on similar stuff.

        But generally, English is one of the easier languages to process. Its morphology is simple (almost no declension, simple conjugation) and the training data for statistical methods are huge.

Re^3: How to count the vocabulary of an author?
LanX (Sage)
    There are look-up tables for that.

    And even if they didn't exist you can derive most stems by statistical analysis, at least with the Indo-European languages I know.

    Good enough for a word count.

    ) because they have in most cases a fixed stem. I suppose Finnish to be much harder...

