element22 has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that contains hashtags (on a single line). Some of them may be repeated and non-adjacent, e.g. "#tag1 #tag2 #tag3 #tag1". I can detect duplicates with this:

$tags=~/(#\S+).+$1/;

But I fail to remove the duplicate with the line below. What am I doing wrong?

$tags=~s/(#\S+\s?)(.+$1)/$2/g;

Replies are listed 'Best First'.
Re: Regex: remove non-adjacent duplicate hashtags
by LanX (Saint) on Jul 23, 2022 at 13:36 UTC
    > What am I doing wrong?

    you have multiple issues

    • use \1 instead of $1 on the match side
    • /g moves your start pos where the last match stopped, so you either need to use a lookahead (?=) or \K to keep the position after the first group
    • your $1 will include whitespaces in the end and won't match at \n

    here a demo with perl -dE0

    DB<70> $r = " #1 #2 #3 #1 #3 #2 #3 #2" DB<72> $_=$r; s/(#\S+\s?)(.+$1)(?{say "$_ ($1) - ($2) pos:",pos($_)}) +/$2/g ;say "fin: $_" #1 #2 #3 #1 #3 #2 #3 #2 (#1 ) - (#2 #3 #1 #3 #2 #3 #2) pos:24 fin: #2 #3 #1 #3 #2 #3 #2

    you can use this as template to experiment and debug with (?{...})

    update

    like using \K

    DB<73> $_=$r; s/(#\S+\s?)\K(.+\1)(?{say "$_ ($1) - ($2) pos:",pos($_ +)})//g ;say #1 #2 #3 #1 #3 #2 #3 #2 (#1 ) - (#2 #3 #1 ) pos:13 + #1 #2 #3 #1 #3 #2 #3 #2 (#3 ) - (#2 #3 ) pos:22 + #1 #3 #2

    update

    like using (?=...)

    DB<81> sub DEB {say "$_ ($1) - ($2) pos:",pos($_)} DB<82> $_=$r; s/(#\S+\s?)(?=(.+\1))(?{DEB})//g ;say "fin: $_" + #1 #2 #3 #1 #3 #2 #3 #2 (#1 ) - (#2 #3 #1 ) pos:4 + #1 #2 #3 #1 #3 #2 #3 #2 (#2 ) - (#3 #1 #3 #2 ) pos:7 + #1 #2 #3 #1 #3 #2 #3 #2 (#3 ) - (#1 #3 #2 #3 ) pos:10 + #1 #2 #3 #1 #3 #2 #3 #2 (#3 ) - (#2 #3 ) pos:16 + #1 #2 #3 #1 #3 #2 #3 #2 (#2) - ( #3 #2) pos:18 + fin: #1 #3 #2 + + + DB<83>

    update

    as you can see in the debug was \K not the best idea, but YMMV

    update

    you still need to fix the whitespace issue ...

    ... and IIRC there was a meta to keep the pos (\g or \G ?), please check perlre (I might misremember tho)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Regex: remove non-adjacent duplicate hashtags (updated)
by haukex (Archbishop) on Jul 23, 2022 at 13:10 UTC

    You need to use \1 instead of $1, see perlre.

    You may also want to move the \s? out of the capture group.

    Update: Plus, what LanX said about using a lookahead!

      Thanks, it works. Here's my more elaborated version:

      $tags=~s/(#\S{2,}\b)(.*\s?)(\1)(\s?.*)/$3$2$4/g;

Re: Regex: remove non-adjacent duplicate hashtags
by tybalt89 (Monsignor) on Jul 23, 2022 at 18:54 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11145665 use warnings; $_ = "#tag1 #tag2 #tag3 #tag1\n"; print " original: $_"; s/(#\w{2,})\b\h*(?=.*\1\b)//g; print " keeping only the last one: $_"; $_ = "#tag1 #tag2 #tag3 #tag1\n"; 1 while s/(#\w{2,})\b.*\K\1\b\h*//; print "keeping only the first one: $_";

    Outputs:

    original: #tag1 #tag2 #tag3 #tag1 keeping only the last one: #tag2 #tag3 #tag1 keeping only the first one: #tag1 #tag2 #tag3
Re: Regex: remove non-adjacent duplicate hashtags (updated)
by AnomalousMonk (Archbishop) on Jul 23, 2022 at 21:32 UTC
    $tags=~/(#\S+).+$1/;

    Why is it a bad idea to compile a capture variable, e.g., $1, into a regex as haukex and LanX have noted?

    Capture variables have the values they were assigned on execution of the most recent | most recent successful m// or s/// match, or undef if no match | no successful match has ever been done.

    Win8 Strawberry 5.8.9.5 (32) Sat 07/23/2022 16:42:27 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); dd 'A: $1', $1; my $rx = qr/(#\S+).+$1/; print "B: $rx \n"; 'OOPS' =~ /(OOPS)/; $rx = qr/(#\S+).+$1/; print "C: $rx \n"; ^Z ("A: \$1", undef) Use of uninitialized value in concatenation (.) or string at - line 8. B: (?-xism:(#\S+).+) C: (?-xism:(#\S+).+OOPS)

    In this example, $1 is undefined at A because no match that assigned a defined value to it has ever been done.

    At B, a regex is compiled using the undefined $1. Compilation produces an "uninitialized value" warning, and the compiled regex has nothing where $1 was. This is because an undefined value is stringized as the empty string. This is a good example of why warnings (and strict!) should always be enabled. You make no mention of any warning message, so I assume you did not do this. I turn the arched eyebrow of scorn upon you.

    At C, the same regex is compiled again after $1 has been given a defined value by a match. The result in this case is a potentially very problematic bug. Note that no warning message is generated! Good luck with this one.

    The take-away: always use warnings and strict.


    Give a man a fish:  <%-{-{-{-<

Re: Regex: remove non-adjacent duplicate hashtags
by perlfan (Parson) on Jul 24, 2022 at 04:03 UTC
    I know you're asking for a regexp, but here's another way to do it that will give you an intermediate count of how many dupes;
    my $s = "#tag1 #tag2 #tag3 #tag1"; print qq{Original: $s\n}; my ($x, $d); map { ++$x->{$_}; $d->{$_}=$x->{$_} if $x->{$_} > 1 } split /\s+/, $s; printf qq{Uniq: %s\n}, join(' ', sort keys %$x); printf qq{Dupes found for: %s\n}, join(', ', sort keys %$d);
    Output:
    Original: #tag1 #tag2 #tag3 #tag1 Uniq: #tag1 #tag2 #tag3 Dupes found for: #tag1
    I am sure this can be shrunk down quite a bit, but I tried to strike a balance between terseness and readability. You can inspect $x for counts.

      To solve the presented problem this type of approach is the one I would use too but with List::Util::uniq for speed and clarity. Core modules with XS are ideal for this type of task.

      #!/usr/bin/env perl use strict; use warnings; use List::Util 'uniq'; my $tags = '#tag1 #tag2 #tag3 #tag1'; my $uniq = join ' ', uniq split / /, $tags; print "Orig: $tags\nUniq: $uniq\n";

      🦛

        Yes, I was just showing the different approach which seemed to solve the issue, i.e., getting the list of unique tags. I also wanted to show how to do that with hashrefs; but sure what you say is another option.