in reply to Re^3: Problem getting Russian stopwords
in thread Problem getting Russian stopwords

> Possible attempt to separate words with commas at /home/choroba/1.pl line 17.

The warning is right. Remove the commas from the qw and "что," will become "что".

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^5: Problem getting Russian stopwords
by Aldebaran (Curate) on Sep 18, 2018 at 21:12 UTC

    Even so, this fails to remove the word. It occurs to me that the russian list would have to be a lot longer, as they have cases and genders that transform the words on the list. For example "одно от другого" means one from the other, and all of its constituent words are in the list in the nominative, male form, while the neuter одно is not and the genitive другого is not.

    $ ./3.stopwords.pl 
    Боже даруй мне душевный покой
    Принять то что я не в силах изменить
    Мужество изменить то что могу
    И мудрость отличить одно от другого
    $ cat 3.stopwords.pl 
    #!/usr/bin/perl -w
    use 5.011;
    use utf8;
    binmode STDOUT, ":encoding(UTF-8)";
    use Lingua::StopWords qw( getStopWords );
    my $stopwords = getStopWords('ru');
    use Encode;
    
    # say join "|", map decode("KOI8-R", $_), keys %$stopwords;
    # say $/;
    
    my $sentence = "Боже, даруй мне душевный покой
    Принять то, что я не в силах изменить,
    Мужество изменить то, что могу,
    И мудрость отличить одно от другого.";
    
    $sentence =~ s/,//g;
    $sentence =~ s/\.//g;
    
    my @words = split / /, $sentence;
     
    say join ' ', grep { !$stopwords->{$_} } @words;
    __END__ 
    
    $ 
    

    As I look at the module for german, which allegedly works, and compare it to the russian, which seems not to, I notice that there are 2 lists in both. In the german, I can read the special characters, esstet and umlauts, in the first list, while the second is all diamonds with a question mark in middle. In the russian, I can read 0 characters in the first list, and the second list is 100% diamonds with question marks in the middle. I have to wonder if having garden-variety cyrillic in the first list is not what it needs. Abridged listing of the modules: