zulqernain has asked for the wisdom of the Perl Monks concerning the following question:

hi i am trying to remove the stopwords from a text . e.g text : "Depending on the CD writing software you use, you will be looking for a .IMG file or a .ISO file. If you don't see the file and are browsing the correct folder, try one of". The text is saved in a file whose path i give from command line. i wrote this code.
#!/usr/bin/perl use warnings; use strict; my (@data,$word,@lessWords,%stopwords,$stop); open( LIST, "stopwords.txt" ) or die "$!"; my @stopwords = <LIST>; close LIST; chomp @stopwords; $/=""; while(<>) { push @data, $_; } foreach $word (@data) { foreach $stop (@stopwords) { next if $word eq $stop; } push(@lessWords, $word); } print "@lessWords\n";
but it not removing the stopwords. does an one know why? and how i can fix it ?

Replies are listed 'Best First'.
Re: can you please fix the error
by Joost (Canon) on Jun 01, 2005 at 16:47 UTC
      you were right i was comparing the whole paragrapgh with a word and that why $word and $stopword were never equal.$word has a full paragraph. now i have fixed that (i hope). but its still not working :(
      foreach $word (@data) { my @arr=split(/ /,$word); WORD: foreach $a (@arr) { STOP: foreach $stop (@stopwords) { next WORD if $a eq $stop; } } push(@lessWords, $word); } print "@lessWords\n";
      plz help
        You're still pushing $word onto @lessWords regardless of the match. Simple fix:
        foreach $word (@data) { my @arr=split(/ /,$word); WORD: foreach $a (@arr) { foreach $stop (@stopwords) { next WORD if $a eq $stop; } push(@lessWords, $word); } }
        using a hash for @stopwords would make this code simpler and faster, though. There are some examples in this thread somewhere.

        update: the above code is still buggy. See below for a much simpler version that should work.

Re: can you please fix the error
by davidrw (Prior) on Jun 01, 2005 at 16:43 UTC
    The next statement you have is essentially useless -- it breaks out of the inner for-loop only, which means that the push is always executed. this could be resolved by using a label so that the next applies to the outer for loop, or by using a boolean flag, but consider a different approach -- a hash to provide a dictionary (grep would be natural, but inefficient in this case):
    my %stopwords = map { $_ => undef } @stopwords; $/=" "; while(<>){ chomp; push @lessWords, $_ unless exists $stopwords{$_}; }

    Update: An after-thought rewrite:
    my %stopwords = map { $_ => undef } @stopwords; $/=" "; @lessWords = grep( exists $stopwords{$_}, map {chomp; $_} <> )
Re: can you please fix the error
by dbwiz (Curate) on Jun 01, 2005 at 16:44 UTC

    Not the most efficient method, but anyway, assuming that both your files contain one word per line, the mistake is here:

    WORD: foreach $word (@data) { STOP: foreach $stop (@stopwords) { next WORD if $word eq $stop; # <--- } push(@lessWords, $word); }

    Whenever a stopword matches, you should exite the outer loop.

    Consider using a hash instead.

    my %stops = map {$_,undef} @stopwords; for my $word (@data) { next if exists $stops{$word}; push(@lessWords, $word); }
Re: can you please fix the error
by davidrw (Prior) on Jun 01, 2005 at 16:58 UTC
    My response above is actually very similar to a response by graff, Re: removing stop words, to another node you posted about this same general problem -- were the solutions in your previous thread insufficient? if so, let us know how so, so that further advice can be given in the same context.
Re: can you please fix the error
by thundergnat (Deacon) on Jun 01, 2005 at 16:44 UTC

    My suspicion is that $word and $stop are never equal to each other, so the line

    next if $word eq $stop;

    is never true.

    You are loading @data in paragraph mode (may contain newlines) and @stopwords in line mode with chomp (will not contain newlines). Could your problen lie there? Without some test data, it is hard to say exactly what the problem is.