Regex: remove non-adjacent duplicate hashtags

element22 has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that contains hashtags (on a single line). Some of them may be repeated and non-adjacent, e.g. "#tag1 #tag2 #tag3 #tag1". I can detect duplicates with this:

$tags=~/(#\S+).+$1/;

But I fail to remove the duplicate with the line below. What am I doing wrong?

$tags=~s/(#\S+\s?)(.+$1)/$2/g;

Comment on Regex: remove non-adjacent duplicate hashtags Select or Download Code

Replies are listed 'Best First'.
Re: Regex: remove non-adjacent duplicate hashtags by LanX (Saint) on Jul 23, 2022 at 13:36 UTC
> What am I doing wrong? you have multiple issues use \1 instead of $1 on the match side /g moves your start `pos` where the last match stopped, so you either need to use a lookahead `(?=)` or `\K` to keep the position after the first group your $1 will include whitespaces in the end and won't match at `\n` here a demo with `perl -dE0` `DB<70> $r = " #1 #2 #3 #1 #3 #2 #3 #2" DB<72> $_=$r; s/(#\S+\s?)(.+$1)(?{say "$_ ($1) - ($2) pos:",pos($_)}) +/$2/g ;say "fin: $_" #1 #2 #3 #1 #3 #2 #3 #2 (#1 ) - (#2 #3 #1 #3 #2 #3 #2) pos:24 fin: #2 #3 #1 #3 #2 #3 #2` [download] you can use this as template to experiment and debug with `(?{...})` update like using `\K` `DB<73> $_=$r; s/(#\S+\s?)\K(.+\1)(?{say "$_ ($1) - ($2) pos:",pos($_ +)})//g ;say #1 #2 #3 #1 #3 #2 #3 #2 (#1 ) - (#2 #3 #1 ) pos:13 + #1 #2 #3 #1 #3 #2 #3 #2 (#3 ) - (#2 #3 ) pos:22 + #1 #3 #2` [download] update like using `(?=...)` `DB<81> sub DEB {say "$_ ($1) - ($2) pos:",pos($_)} DB<82> $_=$r; s/(#\S+\s?)(?=(.+\1))(?{DEB})//g ;say "fin: $_" + #1 #2 #3 #1 #3 #2 #3 #2 (#1 ) - (#2 #3 #1 ) pos:4 + #1 #2 #3 #1 #3 #2 #3 #2 (#2 ) - (#3 #1 #3 #2 ) pos:7 + #1 #2 #3 #1 #3 #2 #3 #2 (#3 ) - (#1 #3 #2 #3 ) pos:10 + #1 #2 #3 #1 #3 #2 #3 #2 (#3 ) - (#2 #3 ) pos:16 + #1 #2 #3 #1 #3 #2 #3 #2 (#2) - ( #3 #2) pos:18 + fin: #1 #3 #2 + + + DB<83>` [download] update as you can see in the debug was \K not the best idea, but YMMV update you still need to fix the whitespace issue ... ... and IIRC there was a meta to keep the pos (\g or \G ?), please check `perlre` (I might misremember tho) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: Regex: remove non-adjacent duplicate hashtags (updated) by haukex (Archbishop) on Jul 23, 2022 at 13:10 UTC
You need to use `\1` instead of `$1`, see perlre. You may also want to move the `\s?` out of the capture group. Update: Plus, what LanX said about using a lookahead!	[reply] [d/l] [select]
Re^2: Regex: remove non-adjacent duplicate hashtags by element22 (Novice) on Jul 23, 2022 at 13:24 UTC
Thanks, it works. Here's my more elaborated version: `$tags=~s/(#\S{2,}\b)(.\s?)(\1)(\s?.)/$3$2$4/g;`	[reply] [d/l]
Re: Regex: remove non-adjacent duplicate hashtags by tybalt89 (Monsignor) on Jul 23, 2022 at 18:54 UTC
`#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11145665 use warnings; $_ = "#tag1 #tag2 #tag3 #tag1\n"; print " original: $_"; s/(#\w{2,})\b\h(?=.\1\b)//g; print " keeping only the last one: $_"; $_ = "#tag1 #tag2 #tag3 #tag1\n"; 1 while s/(#\w{2,})\b.\K\1\b\h//; print "keeping only the first one: $_";` [download] Outputs: `original: #tag1 #tag2 #tag3 #tag1 keeping only the last one: #tag2 #tag3 #tag1 keeping only the first one: #tag1 #tag2 #tag3` [download]	[reply] [d/l] [select]
Re: Regex: remove non-adjacent duplicate hashtags (updated) by AnomalousMonk (Archbishop) on Jul 23, 2022 at 21:32 UTC
`$tags=~/(#\S+).+$1/;` Why is it a bad idea to compile a capture variable, e.g., `$1`, into a regex as haukex and LanX have noted? Capture variables have the values they were assigned on execution of the ~~most recent~~ \| most recent successful `m//` or `s///` match, or undef if ~~no match~~ \| no successful match has ever been done. `Win8 Strawberry 5.8.9.5 (32) Sat 07/23/2022 16:42:27 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); dd 'A: $1', $1; my $rx = qr/(#\S+).+$1/; print "B: $rx \n"; 'OOPS' =~ /(OOPS)/; $rx = qr/(#\S+).+$1/; print "C: $rx \n"; ^Z ("A: \$1", undef) Use of uninitialized value in concatenation (.) or string at - line 8. B: (?-xism:(#\S+).+) C: (?-xism:(#\S+).+OOPS)` [download] In this example, `$1` is undefined at A because no match that assigned a defined value to it has ever been done. At B, a regex is compiled using the undefined `$1`. Compilation produces an "uninitialized value" warning, and the compiled regex has nothing where `$1` was. This is because an undefined value is stringized as the empty string. This is a good example of why warnings (and strict!) should always be enabled. You make no mention of any warning message, so I assume you did not do this. I turn the arched eyebrow of scorn upon you. At C, the same regex is compiled again after `$1` has been given a defined value by a match. The result in this case is a potentially very problematic bug. Note that no warning message is generated! Good luck with this one. The take-away: always use warnings and strict. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Regex: remove non-adjacent duplicate hashtags by perlfan (Parson) on Jul 24, 2022 at 04:03 UTC
I know you're asking for a regexp, but here's another way to do it that will give you an intermediate count of how many dupes; `my $s = "#tag1 #tag2 #tag3 #tag1"; print qq{Original: $s\n}; my ($x, $d); map { ++$x->{$_}; $d->{$_}=$x->{$_} if $x->{$_} > 1 } split /\s+/, $s; printf qq{Uniq: %s\n}, join(' ', sort keys %$x); printf qq{Dupes found for: %s\n}, join(', ', sort keys %$d);` [download] Output: `Original: #tag1 #tag2 #tag3 #tag1 Uniq: #tag1 #tag2 #tag3 Dupes found for: #tag1` [download] I am sure this can be shrunk down quite a bit, but I tried to strike a balance between terseness and readability. You can inspect `$x` for counts.	[reply] [d/l] [select]
Re^2: Regex: remove non-adjacent duplicate hashtags by hippo (Archbishop) on Jul 24, 2022 at 09:20 UTC
To solve the presented problem this type of approach is the one I would use too but with List::Util::uniq for speed and clarity. Core modules with XS are ideal for this type of task. `#!/usr/bin/env perl use strict; use warnings; use List::Util 'uniq'; my $tags = '#tag1 #tag2 #tag3 #tag1'; my $uniq = join ' ', uniq split / /, $tags; print "Orig: $tags\nUniq: $uniq\n";` [download] 🦛	[reply] [d/l]
Re^3: Regex: remove non-adjacent duplicate hashtags by perlfan (Parson) on Jul 26, 2022 at 06:27 UTC
Yes, I was just showing the different approach which seemed to solve the issue, i.e., getting the list of unique tags. I also wanted to show how to do that with hashrefs; but sure what you say is another option.	[reply]
Re^4: Regex: remove non-adjacent duplicate hashtags by AnomalousMonk (Archbishop) on Jul 26, 2022 at 10:19 UTC

Back to Seekers of Perl Wisdom