Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

can you please fix the error

by zulqernain (Novice)
on Jun 01, 2005 at 16:34 UTC ( [id://462543]=perlquestion: print w/replies, xml ) Need Help??

zulqernain has asked for the wisdom of the Perl Monks concerning the following question:

hi i am trying to remove the stopwords from a text . e.g text : "Depending on the CD writing software you use, you will be looking for a .IMG file or a .ISO file. If you don't see the file and are browsing the correct folder, try one of". The text is saved in a file whose path i give from command line. i wrote this code.
#!/usr/bin/perl use warnings; use strict; my (@data,$word,@lessWords,%stopwords,$stop); open( LIST, "stopwords.txt" ) or die "$!"; my @stopwords = <LIST>; close LIST; chomp @stopwords; $/=""; while(<>) { push @data, $_; } foreach $word (@data) { foreach $stop (@stopwords) { next if $word eq $stop; } push(@lessWords, $word); } print "@lessWords\n";
but it not removing the stopwords. does an one know why? and how i can fix it ?

Replies are listed 'Best First'.
Re: can you please fix the error
by Joost (Canon) on Jun 01, 2005 at 16:47 UTC
      you were right i was comparing the whole paragrapgh with a word and that why $word and $stopword were never equal.$word has a full paragraph. now i have fixed that (i hope). but its still not working :(
      foreach $word (@data) { my @arr=split(/ /,$word); WORD: foreach $a (@arr) { STOP: foreach $stop (@stopwords) { next WORD if $a eq $stop; } } push(@lessWords, $word); } print "@lessWords\n";
      plz help
        You're still pushing $word onto @lessWords regardless of the match. Simple fix:
        foreach $word (@data) { my @arr=split(/ /,$word); WORD: foreach $a (@arr) { foreach $stop (@stopwords) { next WORD if $a eq $stop; } push(@lessWords, $word); } }
        using a hash for @stopwords would make this code simpler and faster, though. There are some examples in this thread somewhere.

        update: the above code is still buggy. See below for a much simpler version that should work.

Re: can you please fix the error
by davidrw (Prior) on Jun 01, 2005 at 16:43 UTC
    The next statement you have is essentially useless -- it breaks out of the inner for-loop only, which means that the push is always executed. this could be resolved by using a label so that the next applies to the outer for loop, or by using a boolean flag, but consider a different approach -- a hash to provide a dictionary (grep would be natural, but inefficient in this case):
    my %stopwords = map { $_ => undef } @stopwords; $/=" "; while(<>){ chomp; push @lessWords, $_ unless exists $stopwords{$_}; }

    Update: An after-thought rewrite:
    my %stopwords = map { $_ => undef } @stopwords; $/=" "; @lessWords = grep( exists $stopwords{$_}, map {chomp; $_} <> )
Re: can you please fix the error
by dbwiz (Curate) on Jun 01, 2005 at 16:44 UTC

    Not the most efficient method, but anyway, assuming that both your files contain one word per line, the mistake is here:

    WORD: foreach $word (@data) { STOP: foreach $stop (@stopwords) { next WORD if $word eq $stop; # <--- } push(@lessWords, $word); }

    Whenever a stopword matches, you should exite the outer loop.

    Consider using a hash instead.

    my %stops = map {$_,undef} @stopwords; for my $word (@data) { next if exists $stops{$word}; push(@lessWords, $word); }
Re: can you please fix the error
by davidrw (Prior) on Jun 01, 2005 at 16:58 UTC
    My response above is actually very similar to a response by graff, Re: removing stop words, to another node you posted about this same general problem -- were the solutions in your previous thread insufficient? if so, let us know how so, so that further advice can be given in the same context.
Re: can you please fix the error
by thundergnat (Deacon) on Jun 01, 2005 at 16:44 UTC

    My suspicion is that $word and $stop are never equal to each other, so the line

    next if $word eq $stop;

    is never true.

    You are loading @data in paragraph mode (may contain newlines) and @stopwords in line mode with chomp (will not contain newlines). Could your problen lie there? Without some test data, it is hard to say exactly what the problem is.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://462543]
Approved by davidrw
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-20 15:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found