in reply to Re^4: Problem getting Russian stopwords
in thread Problem getting Russian stopwords
Even so, this fails to remove the word. It occurs to me that the russian list would have to be a lot longer, as they have cases and genders that transform the words on the list. For example "одно от другого" means one from the other, and all of its constituent words are in the list in the nominative, male form, while the neuter одно is not and the genitive другого is not.
$ ./3.stopwords.pl
Боже даруй мне душевный покой
Принять то что я не в силах изменить
Мужество изменить то что могу
И мудрость отличить одно от другого
$ cat 3.stopwords.pl
#!/usr/bin/perl -w
use 5.011;
use utf8;
binmode STDOUT, ":encoding(UTF-8)";
use Lingua::StopWords qw( getStopWords );
my $stopwords = getStopWords('ru');
use Encode;
# say join "|", map decode("KOI8-R", $_), keys %$stopwords;
# say $/;
my $sentence = "Боже, даруй мне душевный покой
Принять то, что я не в силах изменить,
Мужество изменить то, что могу,
И мудрость отличить одно от другого.";
$sentence =~ s/,//g;
$sentence =~ s/\.//g;
my @words = split / /, $sentence;
say join ' ', grep { !$stopwords->{$_} } @words;
__END__
$
As I look at the module for german, which allegedly works, and compare it to the russian, which seems not to, I notice that there are 2 lists in both. In the german, I can read the special characters, esstet and umlauts, in the first list, while the second is all diamonds with a question mark in middle. In the russian, I can read 0 characters in the first list, and the second list is 100% diamonds with question marks in the middle. I have to wonder if having garden-variety cyrillic in the first list is not what it needs. Abridged listing of the modules:
$ pwd /usr/local/share/perl/5.26.1/Lingua/StopWords $ ls DA.pm EN.pm FI.pm HU.pm NL.pm PT.pm SV.pm DE.pm ES.pm FR.pm IT.pm NO.pm RU.pm $ cat DE.pm package Lingua::StopWords::DE; use strict; use warnings; use Exporter; our @ISA = qw(Exporter); our %EXPORT_TAGS = ( 'all' => qw( getStopWords ) ); our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } ); our $VERSION = 0.08; sub getStopWords { if ( @_ and $_[0] eq 'UTF-8' ) { # adding U0 causes the result to be flagged as UTF-8 my %stoplist = map { ( pack("U0a*", $_), 1 ) } qw( ihn ihm es etwas euer eure eurem euren eurer eures fЭr gegen jetzt kann kein keine keinem keinen keiner keines kЖnnen ); return \%stoplist; } else { my %stoplist = map { ( $_, 1 ) } qw( ihn ihm es etwas euer eure eurem euren eurer eures f�r gegen wollen wollte w�rde w�rden zu zum zur zwar zwischen ); return \%stoplist; } } 1; $ cat RU.pm package Lingua::StopWords::RU; use strict; use warnings; use Exporter; our @ISA = qw(Exporter); our %EXPORT_TAGS = ( 'all' => qw( getStopWords ) ); our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } ); our $VERSION = 0.08; sub getStopWords { if ( @_ and $_[0] eq 'UTF-8' ) { # adding U0 causes the result to be flagged as UTF-8 my %stoplist = map { ( pack("U0a*", $_), 1 ) } qw( и в во не что он на я с со как а то более всегда конечно всю между ); return \%stoplist; } else { my %stoplist = map { ( $_, 1 ) } qw( � � �� �� ��� �� �� � � �� ��� � �� ��� ��� ��� ��� �� �� �� � ������� ��� ����� ); return \%stoplist; } } 1; $
|
|---|