Ok, I'll play 'nasty little boy' too (I remember!)

Of course I had to try the stemming that is built-in in PostgreSQL's full-text search (FTS). I had'nt used it for a while; so this is just playing with it. Below are results of stemming and the distinction between words and stop-words.

I think this FTS-stuff uses snowball, and I don't know how recent the vocabulary is. (UPDATE: I see regular snowball-related updates (every few months) in the PostgreSQL git log so I now think its snowball stuff is reasonably up-to-date)

-- Below are three chunks/resultsets: -- 1. Your text -- 2. Real words: -- select .. from ts_debug('german', '$yourtxt') -- where lexemes > 0 -- 3. Stop-words: -- select .. from ts_debug('german', '$yourtxt') -- where lexemes = 0 txt ---------------------------------------------- Ich Bin Der Geist, Der Stets Verneint! + Und Das Mit Recht; denn alles, was entsteht,+ Ist wert, daß es zugrunde geht; + Drum besser wär's, daß nichts entstünde. + So ist denn alles, was ihr Sünde, + Zerstörung, kurz, das Böse nennt, + Mein eigentliches Element. (1 row) alias | token | dictionary | lexemes -----------+--------------+-------------+------------ asciiword | Geist | german_stem | {geist} asciiword | Stets | german_stem | {stet} asciiword | Verneint | german_stem | {verneint} asciiword | Recht | german_stem | {recht} asciiword | entsteht | german_stem | {entsteht} asciiword | wert | german_stem | {wert} asciiword | zugrunde | german_stem | {zugrund} asciiword | geht | german_stem | {geht} asciiword | Drum | german_stem | {drum} asciiword | besser | german_stem | {bess} word | wär | german_stem | {war} asciiword | s | german_stem | {s} word | entstünde | german_stem | {entstund} word | Sünde | german_stem | {sund} word | Zerstörung | german_stem | {zerstor} asciiword | kurz | german_stem | {kurz} word | Böse | german_stem | {bos} asciiword | nennt | german_stem | {nennt} asciiword | eigentliches | german_stem | {eigent} asciiword | Element | german_stem | {element} (20 rows) alias | token | dictionary | lexemes -----------+--------+-------------+--------- asciiword | Ich | german_stem | {} asciiword | Bin | german_stem | {} asciiword | Der | german_stem | {} asciiword | Der | german_stem | {} asciiword | Und | german_stem | {} asciiword | Das | german_stem | {} asciiword | Mit | german_stem | {} asciiword | denn | german_stem | {} asciiword | alles | german_stem | {} asciiword | was | german_stem | {} asciiword | Ist | german_stem | {} word | daß | german_stem | {} asciiword | es | german_stem | {} word | daß | german_stem | {} asciiword | nichts | german_stem | {} asciiword | So | german_stem | {} asciiword | ist | german_stem | {} asciiword | denn | german_stem | {} asciiword | alles | german_stem | {} asciiword | was | german_stem | {} asciiword | ihr | german_stem | {} asciiword | das | german_stem | {} asciiword | Mein | german_stem | {} (23 rows)

Not perfect but more useful than I thought it would be without any work.


In reply to Re^2: How to count the vocabulary of an author? by erix
in thread How to count the vocabulary of an author? by karlgoethebier

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.