in reply to Regex related question

This single regexp might be what you're after. It uses backreferences to make sure that you match the same thing three or more times, and then keeps only the first two occurrences. All this has to happen at the end of the string.

s/(\w){2}\1+\Z/$1$1/;

Update: Perlbotics solution above is it. ;) Bah, hate it when that happens!

To get a good explanation, run the following one-liner:

perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(qr/( \w){2}\1+\Z/)->explain();'

Update2:

So since I botched it, and the best s/// construct was already posted, I figured I may as well have a little fun with my walk of shame. :)

The following is a substr approach that is more in keeping with how life was before every programming language developed Perl-envy and incorporated its own version of regular expressions (no, Perl didn't invent them, but was a big part of popularizing them). Have a look and enjoy knowing that you live in a Perlish world instead.

use strict; use warnings; use v5.12; my $string = "abcdefggggggggggg"; my $position = length( $string ) - 1; my $find = substr $string, $position, 1; $position-- while substr( $string, $position, 1 ) eq $find; substr( $string, $position + 3, length( $string ) - ( $position + 3 ), + '') if length( $string ) - $position > 3; say $string;

Dave

Replies are listed 'Best First'.
Re^2: Regex related question
by Marshall (Canon) on Aug 08, 2011 at 08:22 UTC
    I think there needs to be a condition so that the last substr is only run if needed. I came up with a similar coding.. If speed is of interest, then I would benchmark these substr approach vs the regex. I've found that sometimes the s/// can be slow, but the regex engine evolves all the time so benchmarking would be the only way to really know for the Perl that is being used.
    #!/usr/bin/perl -w use strict; my @strings = qw ( ACTGCTAGGGGGGG TCAGCTAGCNA ACTGSCGACAAAA GTCTGAGTTATTT); foreach my $str (@strings) { my $last_char = substr ($str,-1,1); my $cur_index = -1; while (substr ($str, --$cur_index,1) eq $last_char){} print "old: $str \n"; substr ($str,$cur_index+1,-$cur_index-3,"") if ($cur_index < 3); print "new: $str\n"; } __END__ old: ACTGCTAGGGGGGG new: ACTGCTAGG old: TCAGCTAGCNA new: TCAGCTAGCNA old: ACTGSCGACAAAA new: ACTGSCGACAA old: GTCTGAGTTATTT new: GTCTGAGTTATT

      I usually would say that the minor speed difference shouldn't matter. But all I know about genome mapping is that it's computationally intensive, so checking it out is probably a good idea.


      Dave

Re^2: Regex related question
by Hena (Friar) on Aug 08, 2011 at 07:26 UTC
    This doesn't quite work. It breaks if there is duplication in end. See for example input: GCTGTGTGTGT.

    Thanks for trying though :).

      :) That darn quantifier position is a pain in the rear. ;) The best way to fix it would be to use Perlbotics solution above.

      I posted a "just for fun" version as an update.


      Dave