http://qs1969.pair.com?node_id=11133778


in reply to How to count the vocabulary of an author?

It's science. Scientific paper about word stemming in the german language. After looking at available software, they developed their own stemmer, on Github. Looks like it supports different programming languages, including perl.

BTW, i know nothing about language analysis. So i asked Uncle Google and Aunt Bing.

Edit: I'm pretty sure if you include the huge vocabulary of curse words he must have known (due to him being a german and loosing two world wars), i'm pretty sure there are a lot more than 500 words he knew.

perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'

Replies are listed 'Best First'.
Re^2: How to count the vocabulary of an author?
by marto (Cardinal) on Jun 11, 2021 at 11:36 UTC

      Yeah, of course. How can i forget?

      Really, it hasn't been the same since my butler Jeeves retired and grandma Yahoo went to prison for financial fraud.

      perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'

        Interesting to look back at some of the 'big names' of the past, and in some cases what they're up to now :P

Re^2: How to count the vocabulary of an author?
by cavac (Parson) on Jun 11, 2021 at 15:36 UTC

    This is sort of an amendment, because it struck me that word counting is much harder than it looks. Let's take a look at the text "little boy" in three different contexts:

    First: Let's consider a 10 year old boy living in Rhode Island: He knows the meaning of the words "little" and "boy" and he heard of a bomb named "Little Boy" in school. It made the U.S. win some war a long time ago and every year there is a celebration. So, three words?

    Second: A 10 year old British girl. She certainly knows the words "little" and "boy", but she never heard of the things that happened in Japan in 1945. Those things are taught at a later age. So, two words?

    Third: A 10 year old girl in Japan. She doesn't know a single word of english. Neither the words "little" nor "boy" have any meaning to her. But every time she walks to school, she walks past the Genbaku dome. She asked her parents about it, and now she knows that an awful and terrifying machine named "Little Boy" killed her great-grandparents and destroyed her city. So, one word?

    perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
      because it struck me that word counting is much harder than it looks

      There is another dimension to what constitutes a word and how they can be differentiated programmatically. They require context which takes us into a totally different level of complexity.

      Let's take the seemingly simple word post

      When you read the word you might think of what someone puts on social media, or perhaps the mail that gets delivered to your door. But equally you may imagine the pole in the ground that keeps your fence upright. Perhaps you have been given a new post at work as your role has changed due to the company being able to post a good profit. Of course, at one time we didn't need to worry - but post the advent of computing, we do!

Re^2: How to count the vocabulary of an author?
by Anonymous Monk on Jun 11, 2021 at 13:49 UTC
    "Edit: I'm pretty sure if you include the huge vocabulary of curse words he must have known (due to him being a german and loosing two world wars), i'm pretty sure there are a lot more than 500 words he knew."

    That's a very stupid comment.

      Why? Those "official" counts only count the written word. We often use a different vocabulary when communicating verbally. Some authors intentionally limit their written vocabulary to make their works accessible to a broader public.

      Cursing and other emotional expressions can often emphasize specific meanings of things said. The written equivalent in modern internet terms would be emoticons. Insofar as the Unicode consortium is concerned, these count as written expressions that are meaningful to the context of the discussion.

      It also depends on the cultural context and the circumstances as to when and what forms of cursing are acceptable or sometimes even required in a conversation. English speakers, and especially people in the United States, are much more prudent, compared to some other cultures. There are many groups out there that look to as outsider if you don't use a very frank and curseword-ridden way to talk to them. So, if you want to be accepted as equal (for example, because you need them on your project), you better have to learn their way of communicating - there shorthands, their curses, whathaveyou.

      It's a bit like driving a vehicle. You have areas in the world where everything is very regulated and everyone keeps to the rules. Than you have the seemingly chaotic i-honked-first-so-i-go-first way it works in other areas of the world. If you go there and rent a car, you better learn their ways and honk that horn.

      It's the same way with politicians. They may have voters from different cultural regions and context. So politicians better all those different ways their voters communicate and adapt when visiting the region. "I am one of you" is a big vote seller. But this might not reflect in a politicians writing and official speeches.

      As for Adenauer, a lot of his potential voters were soldiers and people from many different cultural groups. I'm pretty sure he took the time to adapt his vocabulary when meeting with local groups. But as i said, this might not reflect in his publications, as they were for a much broader public.

      Edit: Also take time to watch the Tom Scott video "What counts as a word?"

      perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
        Would be interesting to know how many new cursewords were introduced in areas like the Philippines and Cuba after being "liberated" by the US ...