Re: Problem getting Russian stopwords

From the Lingua::StopWords docs; not a helpful default approach in my view but it's an old package so…

getStopWords() expects 1-2 arguments. The first, which is required, is an ISO code representing a supported language. If the ISO code cannot be found, getStopWords returns undef.

The second argument should be 'UTF-8' if you want the stopwords encoded in UTF-8. The UTF-8 flag will be turned on, so make sure you understand all the implications of that.

However, it's broken for Russian. :| Russian is returned in KOI8-R by default though so you can use that.This approach should get you by–

use Lingua::StopWords qw( getStopWords );
use Encode;

binmode STDOUT, ":encoding(UTF-8)";
my $list = getStopWords("ru");
print join "|", map decode("KOI8-R", $_), keys %$list;
print $/;
__END__
нее|вы|них|такой|про|а|чего|его|над|надо|он|всегда|человек|нельзя|тем|тоже|мне…

Comment on Re: Problem getting Russian stopwords

Replies are listed 'Best First'.
Re^2: Problem getting Russian stopwords by cormanaz (Deacon) on Sep 18, 2018 at 17:28 UTC
That did it. Many thanks!	[reply]
Re^3: Problem getting Russian stopwords by Aldebaran (Curate) on Sep 18, 2018 at 19:26 UTC
Can I bother you to post your solution? I'm not quite there yet: $ ./2.stopwords.pl Possible attempt to separate words with commas at ./2.stopwords.pl line 15. два\|тебя\|даже\|всегда\|из\|он\|под\|этот\|человек\|опять\|там\|ж\|после\|более\|от\|вы\|ней\|не\|может\|хорошо\|и\|ей\|какая\|разве\|ты\|свою\|этом\|больше\|были\|было\|почти\|что\|я\|со\|другой\|моя\|какой\|всю\|при\|него\|сейчас\|если\|уже\|эту\|но\|нибудь\|впрочем\|куда\|для\|зачем\|много\|конечно\|был\|в\|три\|когда\|потому\|по\|у\|этого\|уж\|мой\|того\|совсем\|или\|еще\|вот\|ним\|перед\|себе\|можно\|а\|сказал\|чтобы\|всех\|наконец\|лучше\|ведь\|ни\|за\|тот\|бы\|тоже\|к\|до\|говорил\|надо\|жизнь\|над\|вас\|сегодня\|они\|ли\|через\|она\|все\|будет\|так\|чтоб\|ничего\|с\|во\|эти\|где\|этой\|хоть\|сказала\|один\|потом\|как\|чего\|такой\|ее\|про\|никогда\|тут\|здесь\|теперь\|быть\|сам\|без\|об\|же\|им\|на\|них\|ну\|кажется\|сказать\|иногда\|кто\|нас\|меня\|есть\|мне\|раз\|то\|чуть\|была\|вдруг\|вам\|себя\|только\|да\|нельзя\|ему\|чем\|между\|его\|их\|нее\|нет\|о\|том\|тем\|тогда\|всего\|мы\|будто Боже, даруй мне душевный покой Принять то, что я не в силах изменить, Мужество изменить то, что могу, И мудрость отличить одно от другого. $ cat 2.stopwords.pl #!/usr/bin/perl -w use 5.011; use utf8; binmode STDOUT, ":encoding(UTF-8)"; use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('ru'); use Encode; say join "\|", map decode("KOI8-R", $_), keys %$stopwords; say $/; my @words = qw( Боже, даруй мне душевный покой Принять то, что я не в силах изменить, Мужество изменить то, что могу, И мудрость отличить одно от другого. ); say join ' ', grep { !$stopwords->{$_} } @words; __END__ $ что and то are on the list but not "stopped." One has to use pre tags to see the cyrillic....	[reply]
Re^4: Problem getting Russian stopwords by choroba (Cardinal) on Sep 18, 2018 at 19:41 UTC
> Possible attempt to separate words with commas at /home/choroba/1.pl line 17. The warning is right. Remove the commas from the `qw` and "что," will become "что". ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^5: Problem getting Russian stopwords by Aldebaran (Curate) on Sep 18, 2018 at 21:12 UTC
Re^4: Problem getting Russian stopwords by Anonymous Monk on Sep 19, 2018 at 07:26 UTC
`map decode("KOI8-R", $_), keys %$stopwords;` The problem is that your stopwords are left undecoded in the hash. You should produce a new hash containing transformed keys instead of throwing the results of `decode` out: `my %stopwords; undef @stopwords{ map decode("KOI8-R", $_), keys %{getStopWords('ru')} };` [download] Also, the stop words are in lower case, which means that you should lowercase your text too before checking whether it's a stopword or not. `say join ' ', grep { ! exists $stopwords{lc $_} } @words;` [download] You may want to `split` your text on `/\W+/` to get the words in one operation. Успехов,	[reply] [d/l] [select]
Re^5: Problem getting Russian stopwords by Your Mother (Archbishop) on Sep 19, 2018 at 08:08 UTC
Re^6: Problem getting Russian stopwords by Anonymous Monk on Sep 19, 2018 at 20:02 UTC
Some notes below your chosen depth have not been shown here