removing repeats

dummy2 has asked for the wisdom of the Perl Monks concerning the following question:

Hey perl experts, I am trying to create a perl script on windows active perl, that prints a word only once for a group of adjacent repeats, and that prints the line number that these words appeared. I've created a script that shows just the words are adjacently repeated with their line numbers but it shows all the repeats of a particular group rather than just one of them. Here is the script that I have done. I'd really appreciate your help, as I am a new learner of perl. Thank you.


#!/usr/bin/perl

# rcwords.pl: print immediately adjacent repeated words once from inpu
+t,
# even if they are repeated more than once,
# and print out these words along with the line number(s) that they ap
+peared.

use English;
use diagnostics;

$prevword = '';
$n = 0;

while ($line = <>) {
  
    $n = $n + 1;

    $line =~ s/[[:punct:]]/ /g;

    $line = lc $line;

    @words = split /[[:space:]]+/, $line;

    foreach $word (@words) {

        if ($word ne '') {

            if ($word eq $prevword) {
                print " $n $word\n";
            }

            $prevword = $word;

        }
    }
}
[download]

20050301 Janitored by Corion: Fixed formatting

Comment on removing repeats Download Code

Replies are listed 'Best First'.
Re: removing repeats by dragonchild (Archbishop) on Mar 01, 2005 at 14:22 UTC
You're printing the word as you discover it's a repeat. You actually want to print the word once you find the next word that doesn't match the repeated word. So, you want to do something like: Look at each word. if( $word eq $prev_word) then $times_seen++ else If $times_seen > 1, print the $line_num and $word. $prev_word = $word. $times_seen = 1. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re^2: removing repeats by Roy Johnson (Monsignor) on Mar 01, 2005 at 15:16 UTC
Actually, with a flip-flop, you can print the 2nd occurrence. `foreach $word (@words) { if ($word ne '') { my $flip = ($word eq $prevword)..($word ne $prevword); if ($flip eq '1') { # Use eq to avoid warning when flip is + '' print " $n $word\n"; } $prevword = $word; } }` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^3: removing repeats by dummy2 (Initiate) on Mar 01, 2005 at 17:01 UTC
Hey Roy Johnson, Thank you for your reply. I tried it and it works.	[reply]
Re^3: removing repeats by dummy2 (Initiate) on Mar 01, 2005 at 17:24 UTC
Hey Roy Johnson, Thank you for your reply. I tried it and it works. Could you explain the logic to me though, please? My understanding of it is you're creating a new variable which you call $flip and saying what this flip is equal to, i.e it's equal to both word equal to previous word and word that isn't equal to previous word, and then if this flip is equal to 1, i.e the word that is the immediate next one along in an input of text. Is my understanding correct? I would really appreciate your help. Thank you,	[reply]
Re^4: removing repeats by Roy Johnson (Monsignor) on Mar 01, 2005 at 17:47 UTC
Re^2: removing repeats by dummy2 (Initiate) on Mar 01, 2005 at 16:49 UTC
Hi again, Thanks for your very prompt reply but it is the words that are adjacently repeated that I want to print. I can do this part, but what my script does is print out all these repeats instead of just showing the repeat once, so for example if a line contains the word hello eight times in a row, i only want this word to show once. here's my script. I will be very grateful for your help. Thank you!! #!/usr/bin/perl # rcwords.pl: print immediately adjacent repeated words once from inpu +t, # even if they are repeated more than once, # and print out these words along with the line number(s) that they ap +peared. use English; use diagnostics; $prevword = ''; $n = 0; while ($line = <>) { $n = $n + 1; $line =~ s/[[:punct:]]/ /g; $line = lc $line; @words = split /[[:space:]]+/, $line; foreach $word (@words) { if ($word ne '') { if ($word eq $prevword) { unless ($prevword eq $prevword) { print " $n $word\n"; } } $prevword = $word; } } } [download] Edit by BazB. Add code tags.	[reply] [d/l]
Re: removing repeats by holli (Abbot) on Mar 01, 2005 at 15:01 UTC
You can use a regex to remove the duplicates: `use strict; use warnings; my $strng = "hello hello the beatle said said to his fans"; my @words = $strng =~ /\G(\w+ ?)\1/g; print join "", @words; # hello the beatle said to his fans` [download] holli, /regexed monk/	[reply] [d/l]
Re: removing repeats by chas (Priest) on Mar 01, 2005 at 14:33 UTC
You might check out counting lines in perl for some ideas. chas	[reply]
Re: removing repeats by manav (Scribe) on Mar 01, 2005 at 14:30 UTC
`if ($word eq $prevword) {` [download] This will check for adjacent repeats, and print them. Shouldnt you be doing `if ($word ne $prevword) {` [download] instead?? Also there is no need to use $n seperately. The $. variable keeps track of the line numbers for you. Manav	[reply] [d/l] [select]
Re: removing repeats by ww (Archbishop) on Mar 01, 2005 at 17:01 UTC
OT, but this may be an example (trivial, but valid) of violating of the proposition "algorithm first; then code." Valid, that is, at least for cases where the script above may be used to test narrative documents for inadvertent repeats or typos (clearly, that's not the only use, tho the use of the overarching :punct: class seems to lean that way). The substitution on :punct: presents issues in a couple edge cases. Suppose, for a strained example, an individual named "Joe Williams" were the author of a tome on various Williams (eg, Wm. Gates, William of Ghent, Fred Williams), which tome is named "Williams' Williams." Less exotically, suppose a line with fragments of two sentences: "...blab, blah," Foo said to Boo. Boo sneered in reply..." Stripping the punctuation makes the book title satisfy the repeated_word_criteria even tho the original text did not. Similarly, the dialogue example (assuming the lack of a paragraph break as illustrated) would lead the script reporting "Boo" had been duped. Yet adjacent duplication from one line to another would NOT report duplication. In short (because the habits induced by the way [Pp]erl makes writing hasty or one-off scripts easy, I find I have to remind myself often): "algorithm first; then code."	[reply]