Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Ok, I'll play 'nasty little boy' too (I remember!)

Of course I had to try the stemming that is built-in in PostgreSQL's full-text search (FTS). I had'nt used it for a while; so this is just playing with it. Below are results of stemming and the distinction between words and stop-words.

I think this FTS-stuff uses snowball, and I don't know how recent the vocabulary is. (UPDATE: I see regular snowball-related updates (every few months) in the PostgreSQL git log so I now think its snowball stuff is reasonably up-to-date)

-- Below are three chunks/resultsets: -- 1. Your text -- 2. Real words: -- select .. from ts_debug('german', '$yourtxt') -- where lexemes > 0 -- 3. Stop-words: -- select .. from ts_debug('german', '$yourtxt') -- where lexemes = 0 txt ---------------------------------------------- Ich Bin Der Geist, Der Stets Verneint! + Und Das Mit Recht; denn alles, was entsteht,+ Ist wert, daß es zugrunde geht; + Drum besser wär's, daß nichts entstünde. + So ist denn alles, was ihr Sünde, + Zerstörung, kurz, das Böse nennt, + Mein eigentliches Element. (1 row) alias | token | dictionary | lexemes -----------+--------------+-------------+------------ asciiword | Geist | german_stem | {geist} asciiword | Stets | german_stem | {stet} asciiword | Verneint | german_stem | {verneint} asciiword | Recht | german_stem | {recht} asciiword | entsteht | german_stem | {entsteht} asciiword | wert | german_stem | {wert} asciiword | zugrunde | german_stem | {zugrund} asciiword | geht | german_stem | {geht} asciiword | Drum | german_stem | {drum} asciiword | besser | german_stem | {bess} word | wär | german_stem | {war} asciiword | s | german_stem | {s} word | entstünde | german_stem | {entstund} word | Sünde | german_stem | {sund} word | Zerstörung | german_stem | {zerstor} asciiword | kurz | german_stem | {kurz} word | Böse | german_stem | {bos} asciiword | nennt | german_stem | {nennt} asciiword | eigentliches | german_stem | {eigent} asciiword | Element | german_stem | {element} (20 rows) alias | token | dictionary | lexemes -----------+--------+-------------+--------- asciiword | Ich | german_stem | {} asciiword | Bin | german_stem | {} asciiword | Der | german_stem | {} asciiword | Der | german_stem | {} asciiword | Und | german_stem | {} asciiword | Das | german_stem | {} asciiword | Mit | german_stem | {} asciiword | denn | german_stem | {} asciiword | alles | german_stem | {} asciiword | was | german_stem | {} asciiword | Ist | german_stem | {} word | daß | german_stem | {} asciiword | es | german_stem | {} word | daß | german_stem | {} asciiword | nichts | german_stem | {} asciiword | So | german_stem | {} asciiword | ist | german_stem | {} asciiword | denn | german_stem | {} asciiword | alles | german_stem | {} asciiword | was | german_stem | {} asciiword | ihr | german_stem | {} asciiword | das | german_stem | {} asciiword | Mein | german_stem | {} (23 rows)

Not perfect but more useful than I thought it would be without any work.


In reply to Re^2: How to count the vocabulary of an author? by erix
in thread How to count the vocabulary of an author? by karlgoethebier

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-04-24 17:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found