Problem getting Russian stopwords

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Good day fellow Monks. I am trying to use Lingua::Stopwords to get a set of Russian stopwords, joined as a string that I can use in a regexp. The problem is that when I run this code:

use Lingua::StopWords qw( getStopWords );
my $rustop = getrustop();
print $rustop;
sub getrustop {
    my $stopwords = getStopWords('ru');
    return join("|",keys %$stopwords);
}
[download]

the string returned is not Cyrillic but something else (part of the string):

дто|разве|может|больше|чтоб|только|есть|конечно|можно|ли|меня|ним|два|
+совсем|ее|были|сказал|если|будет|еще|по|со|об
[download]

What am I doing wrong?

Comment on Problem getting Russian stopwords Select or Download Code

Replies are listed 'Best First'.
Re: Problem getting Russian stopwords by Your Mother (Archbishop) on Sep 18, 2018 at 17:22 UTC
From the Lingua::StopWords docs; not a helpful default approach in my view but it's an old package so… getStopWords() expects 1-2 arguments. The first, which is required, is an ISO code representing a supported language. If the ISO code cannot be found, getStopWords returns undef. The second argument should be 'UTF-8' if you want the stopwords encoded in UTF-8. The UTF-8 flag will be turned on, so make sure you understand all the implications of that. However, it's broken for Russian. :\| Russian is returned in KOI8-R by default though so you can use that.This approach should get you by– use Lingua::StopWords qw( getStopWords ); use Encode; binmode STDOUT, ":encoding(UTF-8)"; my $list = getStopWords("ru"); print join "\|", map decode("KOI8-R", $_), keys %$list; print $/; __END__ нее\|вы\|них\|такой\|про\|а\|чего\|его\|над\|надо\|он\|всегда\|человек\|нельзя\|тем\|тоже\|мне…	[reply]
Re^2: Problem getting Russian stopwords by cormanaz (Deacon) on Sep 18, 2018 at 17:28 UTC
That did it. Many thanks!	[reply]
Re^3: Problem getting Russian stopwords by Aldebaran (Curate) on Sep 18, 2018 at 19:26 UTC
Can I bother you to post your solution? I'm not quite there yet: $ ./2.stopwords.pl Possible attempt to separate words with commas at ./2.stopwords.pl line 15. два\|тебя\|даже\|всегда\|из\|он\|под\|этот\|человек\|опять\|там\|ж\|после\|более\|от\|вы\|ней\|не\|может\|хорошо\|и\|ей\|какая\|разве\|ты\|свою\|этом\|больше\|были\|было\|почти\|что\|я\|со\|другой\|моя\|какой\|всю\|при\|него\|сейчас\|если\|уже\|эту\|но\|нибудь\|впрочем\|куда\|для\|зачем\|много\|конечно\|был\|в\|три\|когда\|потому\|по\|у\|этого\|уж\|мой\|того\|совсем\|или\|еще\|вот\|ним\|перед\|себе\|можно\|а\|сказал\|чтобы\|всех\|наконец\|лучше\|ведь\|ни\|за\|тот\|бы\|тоже\|к\|до\|говорил\|надо\|жизнь\|над\|вас\|сегодня\|они\|ли\|через\|она\|все\|будет\|так\|чтоб\|ничего\|с\|во\|эти\|где\|этой\|хоть\|сказала\|один\|потом\|как\|чего\|такой\|ее\|про\|никогда\|тут\|здесь\|теперь\|быть\|сам\|без\|об\|же\|им\|на\|них\|ну\|кажется\|сказать\|иногда\|кто\|нас\|меня\|есть\|мне\|раз\|то\|чуть\|была\|вдруг\|вам\|себя\|только\|да\|нельзя\|ему\|чем\|между\|его\|их\|нее\|нет\|о\|том\|тем\|тогда\|всего\|мы\|будто Боже, даруй мне душевный покой Принять то, что я не в силах изменить, Мужество изменить то, что могу, И мудрость отличить одно от другого. $ cat 2.stopwords.pl #!/usr/bin/perl -w use 5.011; use utf8; binmode STDOUT, ":encoding(UTF-8)"; use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('ru'); use Encode; say join "\|", map decode("KOI8-R", $_), keys %$stopwords; say $/; my @words = qw( Боже, даруй мне душевный покой Принять то, что я не в силах изменить, Мужество изменить то, что могу, И мудрость отличить одно от другого. ); say join ' ', grep { !$stopwords->{$_} } @words; __END__ $ что and то are on the list but not "stopped." One has to use pre tags to see the cyrillic....	[reply]
Re^4: Problem getting Russian stopwords by choroba (Cardinal) on Sep 18, 2018 at 19:41 UTC
Re^5: Problem getting Russian stopwords by Aldebaran (Curate) on Sep 18, 2018 at 21:12 UTC
Re^4: Problem getting Russian stopwords by Anonymous Monk on Sep 19, 2018 at 07:26 UTC
Re^5: Problem getting Russian stopwords by Your Mother (Archbishop) on Sep 19, 2018 at 08:08 UTC
Some notes below your chosen depth have not been shown here