Q.and has asked for the wisdom of the Perl Monks concerning the following question:
I have not been able to make a regex that matches a discontinuous string and still follows the numeric range set in {}. See below, the first example gives correct behavior only when the string is continuous and the second demonstrates its flaws when the string is interrupted by dashes. Example one:
$seq = 'ATCGGATCTGGC';
$tag = '___';
$seq =~ s/[ATGC]{2}/$&$tag/;
$seq =~ s/$tag[ATGC]{4}/$&$tag/;
printing $seq will output:
AT___CGGA___TCTGGC
And that’s exactly what they should do, BUT, if $seq has dashes, the regexes are not appropriate.
In example 2 below, say that all is the same as above, except that
$seq = 'A-C-G--CTGGC';
Printing $seq now outputs:
A-C-G--CT___GGC
instead of the desired output where it effectively ignores dashes and only count letters:
A-C___G--CTG___GC
Any ideas on how I can write the regexes to match when gaps are included?
Re: Regex to match range of characters broken by dashes
by choroba (Cardinal) on Jul 15, 2016 at 22:20 UTC
|
Where did the dash on the 4th position in the second string go?
Assuming it should have stayed, there's a not-only-regex solution:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my @expected = ('AT___CGGA___TCTGGC', 'A-C___-G--CTG___GC');
# ^
my $tag = '___';
for ('ATCGGATCTGGC',
'A-C-G--CTGGC',
) {
my $seq = $_;
my $pos = 0;
my $count_letters = 0;
for my $stop (2, 6) {
until ($count_letters == $stop) {
$seq =~ /([ATGC])/g;
$pos = pos $seq;
$count_letters++;
}
substr $seq, $pos, 0, $tag;
pos $seq = $pos + length $tag;
}
my $exp = shift @expected;
say "$seq\t$exp\t", $exp eq $seq;
}
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
Re: Regex to match range of characters broken by dashes
by RonW (Parson) on Jul 15, 2016 at 22:17 UTC
|
$seq =~ s/-//g; # remove any dashes, first
$seq =~ s/[ATGC]{2}/$&$tag/;
$seq =~ s/$tag[ATGC]{4}/$&$tag/;
Update: As pointed out, in some circumstances, the desired output can have dashes.
So, add a look-ahead so that only single dashes are removed:
$seq =~ s/-(?!-)//g;
| [reply] [d/l] [select] |
|
| [reply] |
Re: Regex to match range of characters broken by dashes
by hippo (Bishop) on Jul 15, 2016 at 21:57 UTC
|
| [reply] |
Re: Regex to match range of characters broken by dashes
by AnomalousMonk (Archbishop) on Jul 16, 2016 at 06:30 UTC
|
Like choroba, I'm wondering: What's supposed to happen to the dash in the 4th position in the second string?
A-C-G--CTGGC
^ dash in 4th position
Assuming it should be replaced by $tag because it's between the quantified groups of bases, here's a multi-regex solution. (Warning: Needs Perl version 5.10+ for the \K regex operator — but I can get around that fairly easily if needed.)
c:\@Work\Perl>perl -wMstrict -le
"use 5.010;
;;
use Test::More 'no_plan';
use Test::NoWarnings;
;;
my $tag = '___';
;;
VECTOR:
for my $ar_vector (
[ qw(ATCGGATCTGGC AT___CGGA___TCTGGC) ],
[ qw(A-C-G--CTGGC A-C___G--CTG___GC) ],
) {
if (! ref $ar_vector) {
note $ar_vector;
next VECTOR;
}
;;
my ($seq, $expected) = @$ar_vector;
my $got = xform($seq);
is $got, $expected, qq{'$seq' -> '$expected'};
}
;;
done_testing;
;;
sub xform {
my ($s) = @_;
;;
my $u = qr{ [ATGC] -*? }xms;
;;
$s =~ s{ $u{2} \K -* }{$tag}xms;
$s =~ s{ $u{4} \K -* }{$tag}xms;
return $s;
}
"
ok 1 - 'ATCGGATCTGGC' -> 'AT___CGGA___TCTGGC'
ok 2 - 'A-C-G--CTGGC' -> 'A-C___G--CTG___GC'
1..2
ok 3 - no warnings
1..3
Of course, more test cases are highly encouraged!
Update: And yes, this does seem like an XY Problem.
Update 2: Here's the pre-5.10 (no \K) version of the code (tested):
$s =~ s{ ($bu{2}) -* }{$1$tag}xms;
$s =~ s{ ($bu{4}) -* }{$1$tag}xms;
And versions, also tested, consolidating the two substitutions in a for-loop:
$s =~ s{ (?:$bu){$_} \K -* } {$tag}xms for 2, 4; # 5.10+
$s =~ s{ ((?:$bu){$_}) -* } {$1$tag}xms for 2, 4; # pre-5.10
In all these variations,
my $bu = qr{ [ATGC] -*? }xms;
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Regex to match range of characters broken by dashes
by CountZero (Bishop) on Jul 16, 2016 at 11:00 UTC
|
use Modern::Perl qw/2015/;
my $seq = 'A-C-G--CTGGC';
my $tag = '___';
$seq =~ s/([ATGC]-*[ATGC])-*/$1$tag/;
$seq =~ s/($tag[ATGC]-*[ATGC]-*[ATGC]-*[ATGC])-*/$1$tag/;
say $seq;
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] [d/l] |
Re: Regex to match range of characters broken by dashes
by Anonymous Monk on Jul 15, 2016 at 22:08 UTC
|
#!/usr/bin/perl -l
# http://perlmonks.org/?node_id=1167857
use strict;
use warnings;
#my $seq = 'ATCGGATCTGGC';
my $seq = 'A-C-G--CTGGC';
my $tag = '___';
$seq =~ s/(?:-*[ATGC]-*){2}/$& =~ s#^-+|-+$##gr . $tag/e;
$seq =~ s/$tag(?:-*[ATGC]-*){4}/$& =~ s#^-+|-+$##gr . $tag/e;
print $seq;
| [reply] [d/l] |
Re: Regex to match range of characters broken by dashes
by Cristoforo (Curate) on Jul 17, 2016 at 19:20 UTC
|
Here is a solution using a dynamic regex construct. I've never used it before and learned more about it here.
A fly in the ointment was in the code for our $s (@stops). The code wouldn't work here with a 'my' declaration. 'our' was necessary.
This is fairly readable and should work for any number of groups (provided they don't exceed the count of fasta characters in a string of them. I didn't test for that to see how it behaved).
The dynamic regex form was necessary because the count of the quantifier changed for each iteration of the 'for' loop ($s-1).
The printout after the __END__ token shows the results of the run.
Update: Added a final substitution to remove dashes preceding and following the double underscore (as he desired in his post. Missed that.)
#!/usr/bin/perl
use strict;
use warnings;
use 5.014;
my @stops = (2,6); # group by 2 then 4 (6 == 2 + 4)
my $tag = '___';
for ('ATCGGATCTGGC', 'A-C-G--CTGGC') {
my $seq = $_;
for our $s (@stops) { # necessary to use 'our' instead of 'my'
$seq =~ s/
( # begin capture
(??{ # dynamic regex
"(?:[TAGC][^TAGC]*)" . # group to apply quantifi
+er to
"{" . ($s-1) . "}" . # quantifier
"[TAGC]" # end token
}) # end dynamic reference
) # end capture
/$1$tag/x; # end of substitution
}
$seq =~ s/__-+/__/g;
say $seq;
}
__END__
C:\Old_Data\perlp>perl dynamic_regex.pl
AT___CGGA___TCTGGC
A-C___G--CTG___GC
C:\Old_Data\perlp>
| [reply] [d/l] [select] |
|
A fly in the ointment was in the code for our $s (@stops). The code wouldn't work here with a 'my' declaration. 'our' was necessary.
This is a bug that was corrected in Perl version 5.18 IIRC. With this correction, lexical variables always work as expected in "(?{ code })" and "(??{ code })" regex constructs.
The dynamic regex form was necessary because the count of the quantifier changed for each iteration of the 'for' loop ($s-1).
I don't see the necessity here. Except for the fact that aliasing into the @stop array makes calculating the quantifier a bit tedious, it can all be written normally, given that the s/// match regex is, by default, re-compiled on each s/// execution:
c:\@Work\Perl\monks>perl -wMstrict -le
"my @stops = (2,6);
;;
my $tag = '___';
;;
for ('ATCGGATCTGGC', 'A-C-G--CTGGC') {
my $seq = $_;
printf qq{'$seq' -> };
;;
for our $s (@stops) {
local our $q = $s - 1;
$seq =~ s/
((?:[TAGC][^TAGC]*){$q} [TAGC])
/$1$tag/x;
}
print qq{'$seq'};
}
"
'ATCGGATCTGGC' -> 'AT___CGGA___TCTGGC'
'A-C-G--CTGGC' -> 'A-C___-G--CTG___GC'
And except for say, it works under Perl version 5.8.9. See also Re: Regex to match range of characters broken by dashes Update 2 for another for-loop example.
Update: I've based my code example on your original code, prior to adding the second s/// fixup.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
Thanks for pointing out points in my solution that can be stated cleaner.
This is a bug that was corrected in Perl version 5.18 IIRC. With this correction, lexical variables always work as expected in "(?{ code })" and "(??{ code })" regex constructs.
I wasn't aware of that bug. And your local our $q = $s - 1; fixes that.
it can all be written normally, given that the s/// match regex is, by default, re-compiled on each s/// execution
That is a nice solution! The (??{. . .}) construct wasn't necessary.
| [reply] [d/l] [select] |
|
|
|