in reply to Regex to match range of characters broken by dashes
Like choroba, I'm wondering: What's supposed to happen to the dash in the 4th position in the second string?
A-C-G--CTGGC
^ dash in 4th position
Assuming it should be replaced by $tag because it's between the quantified groups of bases, here's a multi-regex solution. (Warning: Needs Perl version 5.10+ for the \K regex operator — but I can get around that fairly easily if needed.)
Of course, more test cases are highly encouraged!c:\@Work\Perl>perl -wMstrict -le "use 5.010; ;; use Test::More 'no_plan'; use Test::NoWarnings; ;; my $tag = '___'; ;; VECTOR: for my $ar_vector ( [ qw(ATCGGATCTGGC AT___CGGA___TCTGGC) ], [ qw(A-C-G--CTGGC A-C___G--CTG___GC) ], ) { if (! ref $ar_vector) { note $ar_vector; next VECTOR; } ;; my ($seq, $expected) = @$ar_vector; my $got = xform($seq); is $got, $expected, qq{'$seq' -> '$expected'}; } ;; done_testing; ;; sub xform { my ($s) = @_; ;; my $u = qr{ [ATGC] -*? }xms; ;; $s =~ s{ $u{2} \K -* }{$tag}xms; $s =~ s{ $u{4} \K -* }{$tag}xms; return $s; } " ok 1 - 'ATCGGATCTGGC' -> 'AT___CGGA___TCTGGC' ok 2 - 'A-C-G--CTGGC' -> 'A-C___G--CTG___GC' 1..2 ok 3 - no warnings 1..3
Update: And yes, this does seem like an XY Problem.
Update 2: Here's the pre-5.10 (no \K) version of the code (tested):
$s =~ s{ ($bu{2}) -* }{$1$tag}xms;
$s =~ s{ ($bu{4}) -* }{$1$tag}xms;
And versions, also tested, consolidating the two substitutions in a for-loop:
$s =~ s{ (?:$bu){$_} \K -* } {$tag}xms for 2, 4; # 5.10+
$s =~ s{ ((?:$bu){$_}) -* } {$1$tag}xms for 2, 4; # pre-5.10
In all these variations,
my $bu = qr{ [ATGC] -*? }xms;
Give a man a fish: <%-{-{-{-<
|
|---|