s/// only replacing most of the time...

mdunnbass has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have a pattern that I'm looking to do a search and replace on, but I am finding that the S&R is only occurring about 2/3 of the time. I looked around a bit, and found this in the perlop page:

Occasionally, you can't use just a /g to get all the changes to occur.
+ Here are two common cases:


    # put commas in the right places in an integer
    1 while s/(.*\d)(\d\d\d)/$1,$2/g;      # perl4
    1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/g;  # perl5

    # expand tabs to 8-column spacing
    1 while s/\t+/' ' x (length($&)*8 - length($`)%8)/e;
[download]

Now, the PATTERN I've been searching for is a complicated one. I was having trouble setting it up initially, and asked for help here and here. The solution I eventually wound up using was a modification of the one provided by johngg, seen here. The pertinent bits of the code I'm using are pasted below (use strict and warnings are both in use, all variables are declared, etc. This is just a snippet)

# patterns to search with are stored in @pattern Array
# patterns are strings of uc DNA (i.e. TTTAGT, etc)
# but can be interrupted by lc or non-DNA characters (i.e., html tags)

# searching for a needle in a haystack:
my $notNeedle = q{[^ACGTYRMKSWBDHVN]*};

 while ($g < $num) {  
   $needle[$g] = $pattern[$g];
   @{$parts[$g]} = split m/(\[|\]|\|)/, $needle[$g];
   my @array = @{$parts[$g]}; my @results;
   while (@array) {
     my $string = shift @array;
     if ($string eq '[') {
       $string .= (join "", splice(@array, 0, 2));
       }
     if ($string eq '|') {
       my $prev = pop @results;
       $string = (join "", $prev, $string, splice(@array, 0, 1));
       }
     push @results, $string;
     }
   @{$parts[$g]} = @results;
   $haystackPatt[$g] =  q{(} . join($notNeedle, @{$parts[$g]}) . q{)};
   $rxHaystack[$g]   = qr{$haystackPatt[$g]};   
   $g++;
   }

$text_seqs[$i] =~ s{$rxHaystack[$j]}{<span class=$class[$a]>$1</span>}
+gemx;
[download]

The problem is, if I search for something where $pattern[$i] = "TTT|AAA", I will get only about 2 out of every three AAAs or TTTs flanked by the <span> tags.

Does anyone have any ideas why this isn't replacing all of the PATTERNS? Any help, ideas, or links pointing to possible explanations would be greatly appreciated.

Thanks
Matt

Comment on s/// only replacing most of the time... Select or Download Code

Replies are listed 'Best First'.
Re: s/// only replacing most of the time... by suaveant (Parson) on May 25, 2007 at 20:07 UTC
I'd just like to point out that your comma adding logic is somewhat flawed. The following regexp from Friedl puts commas in a number without recursion... `s/(?<=\d)(?=(?:\d\d\d)+$)/,/g` [download] It is only when you need to replace within what you substituted that you usually need recursive replacement... - Ant - Some of my best work - (1 2 3)	[reply] [d/l]
Re^2: s/// only replacing most of the time... by mdunnbass (Monk) on May 29, 2007 at 15:27 UTC
I appreciate the comment, but you are actually referring to the perlop perldoc. The bit with the commas was just an example of the only bit of that perldoc that had any mention of s/// not replacing all the time. Otherwise, it has nothing to do with my script. But I thank you for the effort and the consideration nonetheless. Matt	[reply]
Re: s/// only replacing most of the time... by Thelonius (Priest) on May 25, 2007 at 21:49 UTC
Can you give an example where it's not doing what you want? Specifically can you include info like this: `print STDERR "\$hystackPatt[$j] = $haystackPatt[$j]\n"; print STDERR "\$rxHaystack[$j] = $rxHaystack[$j]\n"; print STDERR "BEFORE: \$text_seqs[$i] = $text_seqs[$i]\n"; $text_seqs[$i] =~ s{$rxHaystack[$j]}{<span class=$class[$a]>$1</span>} +gmx; print STDERR "AFTER: \$text_seqs[$i] = $text_seqs[$i]\n";` [download] I had to delete the "e" from the "gemx" on the end of your substitution.	[reply] [d/l]
Re: s/// only replacing most of the time... by suaveant (Parson) on May 25, 2007 at 20:12 UTC
Have you considered that maybe you are doing this the hard way? At the very least HTML::Strip may make your life easier without having to make highly complex regexen. Update never mind... I just realized you are adding the span tags yourself, so you probably want them to stay. Have you considered just marking all the text positions you need and then going by from the right hand side of the string back putting in your tags. This is pretty simple unless you have overlaps, and even then you could do it easier. For instance, keep a list of tags and their positions in the original string, storing the start and end tags, then strip off everything in the original string until the next tag, print the string, than print any tags that belong at that position, and go to the next position... FAR easier than re-iterating over and over again... Here is an example `$_ = 'TTTDDDTTTTTTX'; my @patterns = qw(TTT DDD TD TTTTTT); my %tags; for my $pat (@patterns) { while(/($pat)/g) { print "$pat = $-[0] $+[0]\n"; push @{$tags{$-[0]}}, "<SPAN CLASS=$pat>"; push @{$tags{$+[0]}}, "</SPAN>"; } } my $currentpos = 0; print "$_\n"; for my $pos (sort { $a <=> $b } keys %tags) { print substr($_,0,($pos-$currentpos),''); print join('',@{$tags{$pos}}); $currentpos = $pos; } print $_,"\n";` [download] - Ant - Some of my best work - (1 2 3)	[reply] [d/l]
Re^2: s/// only replacing most of the time... by mdunnbass (Monk) on May 29, 2007 at 15:34 UTC
I have given that some though. I just haven't worked out in my head how to deal with overlaps. However, I do agree that the way I'm doing it now is really really overblown and silly. One thing I noticed though, was the output of your code: `TTT = 0 3 TTT = 6 9 TTT = 9 12 DDD = 3 6 TD = 2 4 TTTTTT = 6 12` [download] It shows the TTT bit matching to 6-9 and 9-12. Ideally, I would need that to match 6-9, 7-10, 8-11, and 9-12. No biggie though. Let me play around with it a bit, and see how it works. Thanks Matt	[reply] [d/l]
Re^3: s/// only replacing most of the time... by suaveant (Parson) on May 29, 2007 at 17:05 UTC
A little bit of smoke and mirrors, then... $_ = 'TTTDDDTTTTTTX'; my @patterns = qw(TTT DDD TD TTTTTT); my %tags; for my $pat (@patterns) { my $pat2 = substr($pat,0,1).'(?='.substr($pat,1,length($pat)).')'; + # T(?=TT) only REALLY matches the first char, but that still works while(/($pat2)/g) { print "$pat = $-[0] $+[0]\n"; push @{$tags{$-[0]}}, "<SPAN CLASS=$pat>"; push @{$tags{$-[0]+length($pat)}}, "</SPAN>"; # instead of usi +ng $+[0] we use $-[0]+the length of the string matched, since it is a + fixed string, no problems. } } my $currentpos = 0; print "$_\n"; for my $pos (sort { $a <=> $b } keys %tags) { print substr($_,0,($pos-$currentpos),''); print join('',@{$tags{$pos}}); $currentpos = $pos; } print $_,"\n"; [download] - Ant - Some of my best work - (1 2 3)	[reply] [d/l]
Re: s/// only replacing most of the time... by mdunnbass (Monk) on Jun 05, 2007 at 14:57 UTC
I figured out what the main problem is. Correct me if I'm wrong, but the s/// leaves off at the position just after the last replacement (`$+[0]`). So, if you have `$str = q{TTTTTTT}; $str =~ s/TTT/ttt/i;` [download] the results is `$str = tttTTTT` But, if you throw a `/g` on the end of that, you'll have: `$str = ttttttT` The problem I've been having is that rather than having s/// leave off at `$+[0]`, I was assuming it left off at `$-[0] + 1`. The difference being, in the example above, with the `/g` included, the results would be: `$str = tttTTTT # after the first replacement $str = ttttTTT # after the second replacement .... $str = ttttttt # after all the global matches are said and done` [download] This is what I had been hoping for. So, my question becomes, is there a way to set s/// to start looking for replacements at `$-[0] + 1` when in a global context, rather than at `$+[0]`	[reply] [d/l] [select]
Re^2: s/// only replacing most of the time... by blazar (Canon) on Jun 05, 2007 at 18:24 UTC
`$str = q{TTTTTTT}; $str = s/TTT/ttt/i;` [download] That should be `=~`. So, my question becomes, is there a way to set s/// to start looking for replacements at `$-[0] + 1` when in a global context, rather than at `$+[0]` In Perl 6 it will be very easy, a matter of specifying the `:ov` (`:overlap`) modifier. In 5's realm you can play with the pos function. There's been a very recent discussion about these topics.	[reply] [d/l] [select]
Re^2: s/// only replacing most of the time... by suaveant (Parson) on Jun 06, 2007 at 03:23 UTC
My last reply does that here... you use positive lookahead to match only one char in reality, then use the string length you are matching to get the right hand position. You could also do it by using \G and pos, but I wouldn't suggest it. The positive lookahead works well with these fixed length string matches. - Ant - Some of my best work - (1 2 3)	[reply]