Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: How to count the vocabulary of an author?

by choroba (Archbishop)
on Jun 11, 2021 at 11:14 UTC ( #11133776=note: print w/replies, xml ) Need Help??


in reply to How to count the vocabulary of an author?

You need a stemmer for the given language and corpus of texts by the author. Read the texts, tokenise them into words, stem each word and store the stem in a hash. At the end, count the number of keys in the hash.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re^2: How to count the vocabulary of an author?
by Bod (Curate) on Jun 12, 2021 at 18:29 UTC

    You make it sound so easy choroba :)

    Having done a few (simpler) things with language, I guess that finding the stem of each word is the trickiest part.

      Well, I have a PhD in mathematical linguistics. Stemming was done in the first year ;-)

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Well, I have a PhD in mathematical linguistics

        Wow! - Genuinely impressive.

        Can I ask your opinion on Hemingway Editor?
        I use it extensively in producing content for our business marketing, blogs, etc. But I have started writing something to perform a similar task but more tailored to our needs. For example, in marketing the ratio of first person to second person pronouns is (thought to be) important. My version makes extensive use of Lingua::EN::Fathom.

        My attempt is not very far developed and I'd love some informed input before I go much further.

      There are look-up tables for that.

      And even if they didn't exist you can derive most stems by statistical analysis, at least with the Indo-European languages I know.

      Good enough for a word count.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      ) because they have in most cases a fixed stem. I suppose Finnish to be much harder...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11133776]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (2)
As of 2022-08-08 14:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?