comment on

Even so, this fails to remove the word. It occurs to me that the russian list would have to be a lot longer, as they have cases and genders that transform the words on the list. For example "одно от другого" means one from the other, and all of its constituent words are in the list in the nominative, male form, while the neuter одно is not and the genitive другого is not.

$ ./3.stopwords.pl 
Боже даруй мне душевный покой
Принять то что я не в силах изменить
Мужество изменить то что могу
И мудрость отличить одно от другого
$ cat 3.stopwords.pl 
#!/usr/bin/perl -w
use 5.011;
use utf8;
binmode STDOUT, ":encoding(UTF-8)";
use Lingua::StopWords qw( getStopWords );
my $stopwords = getStopWords('ru');
use Encode;

# say join "|", map decode("KOI8-R", $_), keys %$stopwords;
# say $/;

my $sentence = "Боже, даруй мне душевный покой
Принять то, что я не в силах изменить,
Мужество изменить то, что могу,
И мудрость отличить одно от другого.";

$sentence =~ s/,//g;
$sentence =~ s/\.//g;

my @words = split / /, $sentence;
 
say join ' ', grep { !$stopwords->{$_} } @words;
__END__ 

$

As I look at the module for german, which allegedly works, and compare it to the russian, which seems not to, I notice that there are 2 lists in both. In the german, I can read the special characters, esstet and umlauts, in the first list, while the second is all diamonds with a question mark in middle. In the russian, I can read 0 characters in the first list, and the second list is 100% diamonds with question marks in the middle. I have to wonder if having garden-variety cyrillic in the first list is not what it needs. Abridged listing of the modules:


$ pwd
/usr/local/share/perl/5.26.1/Lingua/StopWords
$ ls
DA.pm  EN.pm  FI.pm  HU.pm  NL.pm  PT.pm  SV.pm
DE.pm  ES.pm  FR.pm  IT.pm  NO.pm  RU.pm
$ cat DE.pm
package Lingua::StopWords::DE;

use strict;
use warnings;

use Exporter;
our @ISA = qw(Exporter);

our %EXPORT_TAGS = ( 'all' =>  qw( getStopWords )  ); 
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our $VERSION = 0.08;

sub getStopWords {
    if ( @_ and $_[0] eq 'UTF-8' ) {
        # adding U0 causes the result to be flagged as UTF-8
        my %stoplist = map { ( pack("U0a*", $_), 1 ) } qw( 

            ihn ihm es etwas euer eure eurem euren eurer eures fЭr gegen
  
            jetzt kann kein keine keinem keinen keiner keines kЖnnen
    
        );
        return \%stoplist;
    }
    else {
        my %stoplist = map { ( $_, 1 ) } qw( 

            ihn ihm es etwas euer eure eurem euren eurer eures f�r gegen
  
            wollen wollte w�rde w�rden zu zum zur zwar zwischen 
        );
        return \%stoplist;
    }
}

1;
$ cat RU.pm
package Lingua::StopWords::RU;

use strict;
use warnings;

use Exporter;
our @ISA = qw(Exporter);

our %EXPORT_TAGS = ( 'all' =>  qw( getStopWords )  ); 
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our $VERSION = 0.08;

sub getStopWords {
    if ( @_ and $_[0] eq 'UTF-8' ) {
        # adding U0 causes the result to be flagged as UTF-8
        my %stoplist = map { ( pack("U0a*", $_), 1 ) } qw( 
            и в во не что он на я с со как а то
    
            более всегда конечно всю между 
        );
        return \%stoplist;
    }
    else {
        my %stoplist = map { ( $_, 1 ) } qw( 
            � � �� �� ��� �� �� � � �� ��� � �� ��� ��� ��� ��� �� �� �� �
         
            ������� ��� ����� 
        );
        return \%stoplist;
    }
}

1;
$

In reply to Re^5: Problem getting Russian stopwords by Aldebaran
in thread Problem getting Russian stopwords by cormanaz

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.