note
karlgoethebier
<p><b>Update 2:</b></p>
<p>Here is a first simple solution. Anybody needs a starting point:</p>
<c>
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use Data::Dump;
use Lingua::Stem qw(stem);
undef $/;
my $text = <DATA>;
say $text;
$text = lc $text;
$text =~ s/\n+/ /g;
say $text;
$text =~ s/[:;'!?.,]+//g;
say $text;
my @words = split / /, $text;
dd \@words;
Lingua::Stem::set_locale('de');
say Lingua::Stem::get_locale;
my $stems = stem(@words);
dd $stems;
my %vocabulary = map {$_ => 1} @$stems;
dd \%vocabulary;
say scalar keys %vocabulary;
__DATA__
Ich Bin Der Geist, Der Stets Verneint!
Und Das Mit Recht; denn alles, was entsteht,
Ist wert, daß es zugrunde geht;
Drum besser wär's, daß nichts entstünde.
So ist denn alles, was ihr Sünde,
Zerstörung, kurz, das Böse nennt,
Mein eigentliches Element.
</c>
<p>It isn't so easy as one might think: Simply counting the words with [man://wc] doesn't return the vocabulary. And [mod://Lingua::Stem] thinks that <em>Ist</em> and <em>ist</em> are different stems for example. And how to filter out the real text from sources which contain a preface, index, bla? And so on.</p>
<p>Some may ask why i waste my time with this issue. It has to do with politics. As this isn't a forum about politics i skip the details.</p>
<p>I was a little bit inspired by what [href://https://en.m.wikipedia.org/wiki/Jill_Lepore| Jill Lepore] analogously wrote about facts in her splendid book [href://https://www.amazon.de/These-Truths-History-United-States/dp/0393635244 |These Truths: A History of the United States] about facts: "Show me yours and i'll show you mine." Basically the same game that we played with our cousins when we were nasty little boys. Discussion later.</p>
<!-- Node text goes above. Div tags should contain sig only -->
<div class="pmsig"><div class="pmsig-1001958">
<p>«The Crux of the Biscuit is the Apostrophe»</p>
</div></div>
11133775
11133775