Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Is there an easier way to remove duplicate words ie. "I love love this" from a text file?
my ($File, $Line, $Contents, @Lines, $i, $j); $i = 0; $j = 1; $File = "P:\\K\\kmartin\\set1.txt"; # Open file and put it into one long line open (FH, "<$File"); while (defined($Line=<FH>)) { chomp $Line; $Contents .= $Line; } close (FH); # Puts words into an array @Lines = split /\W+/, $Contents; while (<@Lines>) { #compare contents on array cell if ($Lines[$i] eq $Lines[$j]) { print "Duplicate word: $Lines[$i]\n"; $i++; $j++; } #If not equal, increment counters else { $i++; $j++; } }

Replies are listed 'Best First'.
Re: Duplicate Words
by Anonymous Monk on Apr 20, 2001 at 02:17 UTC
    s/(homework\s*){2}/homework /;
      Hmm,
      What about:
      s/(\b\w+?\b)\s+\1/$1/g;
      This will match "This is a sentence sentence." as well as "This is a sentence sentence to.".
      bent
        If you want to remove dup words from anywhere in the string rather than just consecutive duplicates, try:

        s/\w+\s*/$words{$&}++?'':$&/ge;
        or to ignore case:

        s/\w+\s*/$words{"\L$&"}++?'':$&/ge;

homework! (Re: Duplicate Words)
by tye (Sage) on Apr 20, 2001 at 02:17 UTC

    At least this homework assignment wasn't quite so obvious as the last time.

            - tye (but my friends call me "Tye")
Re: Duplicate Words
by jbert (Priest) on Apr 20, 2001 at 17:13 UTC
    OK. It is homework, but some general comments on coding:
    • You have code duplicated in both the 'if' and the 'else' part - this could and should be moved to the end of the loop.
    • You have two variables (i and j), one of which is always one more than the value of the other. Your code would be cleaner if you only used 'i' and replaced 'j' with '(i+1)'.
    • You are using a windows system, which pretends to prefer to use '\' to separate folders. The '\' character is special in many places, including inside strings marked with double quotes ". You can either use a single quotes ' to mark your string (which means you don't have to double your \\ to make them work) or (much better) you can take advantage of the fact that windows is happy to use '/' as a file seperator. i.e.
      $File = "P:/K/kmartin/set1.txt";
      should work fine, and has the advantage of working on Unix or Windows boxes.
    • Perl has lots of features to make code like this simpler. For example, reading an entire file into one variable doesn't require a loop (check the documentation for the '$/' variable in 'perlvar', you can loop through arrays without using 'i' or 'j' as index variables (check documentation for 'foreach' and 'shift,unshift,push and pop')
    • Most importantly, perl is really a nice language built around an excellent regular expression engine. For the kinds of text processing you want to do, check the 'perlre' documentation. Its the "right way" to do this kind of job.
    Good luck with the coding.
    PS. Where I refer to 'perlvar', 'perlre' etc, these are some of the standard documentation which comes with perl. On Windows with Activestate perl, you can often find this in HTML format on the Start button/Programs/Activeperl, and on all systems you can type "perldoc perlvar" (or whatever) at a command prompt and get the information.
Re: Duplicate Words
by orkysoft (Friar) on Apr 20, 2001 at 03:18 UTC

    while (<@Lines>) {

    Cool, I didn't know you could do it that way as well. Still I think foreach is a lot less confusing.

      Would not foreach load all the lines into memory first?
        They're already in memory. (I'm pretty sure)
Re: Duplicate Words
by Anonymous Monk on Apr 20, 2001 at 17:08 UTC
    How about changing the it to finish like this:
    @Lines = split /\W+/, $Contents; While (<@Lines>) { $Contents ~= s/$_//; }
    and save the $Contents as your returning string.

    This is my guess, I'm totally new to Perl.

    2001-04-20 Edit by Corion : Added CODE tags

Re: Duplicate Words
by Anonymous Monk on Apr 20, 2001 at 22:31 UTC
    The main thing that will help you here is to realize that you can use parentheses around the part of the regular expression that matches any word, and then use "\1" in the same regular expression to test for the immediate second occurrence of that word. Actual code of course is the assignment. If you tell us when it's due, maybe we'll post solutions (probably pithier than teacher's :-) afterwards.
Re: Duplicate Words
by Anonymous Monk on Apr 20, 2001 at 14:06 UTC
    #!/usr/local/bin/perl $var="word1 word2 word word word3 someth word3"; print $var."\n"; @words=split(/ /,$var); $len=@words; $var=""; for ($i=0; $i<=$len-1; $i++) { if ($words[$i] eq $words[$i+1]) {print $words[$i]," detected\n"} else {$var.=@words[$i]." "} } print $var."\n";
Re: Duplicate Words
by chorg (Monk) on Apr 20, 2001 at 18:40 UTC
    I believe that the new Camel has this problem as one of the examples - either chapter 2 or the chapter 5...
    _______________________________________________
    "Intelligence is a tool used achieve goals, however goals are not always chosen wisely..."
Re: Duplicate Words
by Sprad (Hermit) on Apr 20, 2001 at 19:53 UTC
    One thing to consider is the fact that that particular course of action might not be correct in all cases. Note the completely proper and desired usage of "that that" in the previous sentence. But then you're getting into grammar checking, and that's probably beyond the scope of your class.

    ---
    I'm too sexy for my .sig.

      For example :- Had had had "had had", had Had had had "had" Had would have been corre +ct. The above was a comment about the grammar of an essay written by an author named Had. My memory concerns me - but I forget why !!!
A reply falls below the community's threshold of quality. You may see it by logging in.