Regex to match range of characters broken by dashes

Q.and has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex to match range of characters broken by dashes by choroba (Cardinal) on Jul 15, 2016 at 22:20 UTC
Where did the dash on the 4th position in the second string go? Assuming it should have stayed, there's a not-only-regex solution: `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my @expected = ('AT___CGGA___TCTGGC', 'A-C___-G--CTG___GC'); # ^ my $tag = '___'; for ('ATCGGATCTGGC', 'A-C-G--CTGGC', ) { my $seq = $_; my $pos = 0; my $count_letters = 0; for my $stop (2, 6) { until ($count_letters == $stop) { $seq =~ /([ATGC])/g; $pos = pos $seq; $count_letters++; } substr $seq, $pos, 0, $tag; pos $seq = $pos + length $tag; } my $exp = shift @expected; say "$seq\t$exp\t", $exp eq $seq; }` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: Regex to match range of characters broken by dashes by RonW (Parson) on Jul 15, 2016 at 22:17 UTC
... where it effectively ignores dashes.... Preprocess the sequence to get ride of the dashes: `$seq =~ s/-//g; # remove any dashes, first $seq =~ s/[ATGC]{2}/$&$tag/; $seq =~ s/$tag[ATGC]{4}/$&$tag/;` [download] Update: As pointed out, in some circumstances, the desired output can have dashes. So, add a look-ahead so that only single dashes are removed: `$seq =~ s/-(?!-)//g;` [download]	[reply] [d/l] [select]
Re^2: Regex to match range of characters broken by dashes by Anonymous Monk on Jul 15, 2016 at 22:21 UTC
But his desired output has dashes.	[reply]
Re: Regex to match range of characters broken by dashes by hippo (Bishop) on Jul 15, 2016 at 21:57 UTC
This has the suspicious look of an XY Problem to me. Would you care to describe the bigger picture?	[reply]
Re: Regex to match range of characters broken by dashes by AnomalousMonk (Archbishop) on Jul 16, 2016 at 06:30 UTC
Like choroba, I'm wondering: What's supposed to happen to the dash in the 4th position in the second string? `A-C-G--CTGGC` `^ dash in 4th position` Assuming it should be replaced by `$tag` because it's between the quantified groups of bases, here's a multi-regex solution. (Warning: Needs Perl version 5.10+ for the `\K` regex operator — but I can get around that fairly easily if needed.) c:\@Work\Perl>perl -wMstrict -le "use 5.010; ;; use Test::More 'no_plan'; use Test::NoWarnings; ;; my $tag = '___'; ;; VECTOR: for my $ar_vector ( [ qw(ATCGGATCTGGC AT___CGGA___TCTGGC) ], [ qw(A-C-G--CTGGC A-C___G--CTG___GC) ], ) { if (! ref $ar_vector) { note $ar_vector; next VECTOR; } ;; my ($seq, $expected) = @$ar_vector; my $got = xform($seq); is $got, $expected, qq{'$seq' -> '$expected'}; } ;; done_testing; ;; sub xform { my ($s) = @_; ;; my $u = qr{ [ATGC] -? }xms; ;; $s =~ s{ $u{2} \K - }{$tag}xms; $s =~ s{ $u{4} \K -* }{$tag}xms; return $s; } " ok 1 - 'ATCGGATCTGGC' -> 'AT___CGGA___TCTGGC' ok 2 - 'A-C-G--CTGGC' -> 'A-C___G--CTG___GC' 1..2 ok 3 - no warnings 1..3 [download] Of course, more test cases are highly encouraged! Update: And yes, this does seem like an XY Problem. Update 2: Here's the pre-5.10 (no `\K`) version of the code (tested): `$s =~ s{ ($bu{2}) -* }{$1$tag}xms;` `$s =~ s{ ($bu{4}) -* }{$1$tag}xms;` And versions, also tested, consolidating the two substitutions in a `for`-loop: `$s =~ s{ (?:$bu){$_} \K -* } {$tag}xms for 2, 4; # 5.10+` `$s =~ s{ ((?:$bu){$_}) -* } {$1$tag}xms for 2, 4; # pre-5.10` In all these variations, `my $bu = qr{ [ATGC] -*? }xms;` Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Regex to match range of characters broken by dashes by CountZero (Bishop) on Jul 16, 2016 at 11:00 UTC
This seems to work: `use Modern::Perl qw/2015/; my $seq = 'A-C-G--CTGGC'; my $tag = '___'; $seq =~ s/([ATGC]-[ATGC])-/$1$tag/; $seq =~ s/($tag[ATGC]-[ATGC]-[ATGC]-[ATGC])-/$1$tag/; say $seq;` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l]
Re: Regex to match range of characters broken by dashes by Anonymous Monk on Jul 15, 2016 at 22:08 UTC
`#!/usr/bin/perl -l # http://perlmonks.org/?node_id=1167857 use strict; use warnings; #my $seq = 'ATCGGATCTGGC'; my $seq = 'A-C-G--CTGGC'; my $tag = '___'; $seq =~ s/(?:-[ATGC]-){2}/$& =~ s#^-+\|-+$##gr . $tag/e; $seq =~ s/$tag(?:-[ATGC]-){4}/$& =~ s#^-+\|-+$##gr . $tag/e; print $seq;` [download]	[reply] [d/l]
Re: Regex to match range of characters broken by dashes by Cristoforo (Curate) on Jul 17, 2016 at 19:20 UTC
Here is a solution using a dynamic regex construct. I've never used it before and learned more about it here. A fly in the ointment was in the code `for our $s (@stops)`. The code wouldn't work here with a 'my' declaration. 'our' was necessary. This is fairly readable and should work for any number of groups (provided they don't exceed the count of fasta characters in a string of them. I didn't test for that to see how it behaved). The dynamic regex form was necessary because the count of the quantifier changed for each iteration of the 'for' loop `($s-1)`. The printout after the `__END__` token shows the results of the run. Update: Added a final substitution to remove dashes preceding and following the double underscore (as he desired in his post. Missed that.) #!/usr/bin/perl use strict; use warnings; use 5.014; my @stops = (2,6); # group by 2 then 4 (6 == 2 + 4) my $tag = '___'; for ('ATCGGATCTGGC', 'A-C-G--CTGGC') { my $seq = $_; for our $s (@stops) { # necessary to use 'our' instead of 'my' $seq =~ s/ ( # begin capture (??{ # dynamic regex "(?:[TAGC][^TAGC]*)" . # group to apply quantifi +er to "{" . ($s-1) . "}" . # quantifier "[TAGC]" # end token }) # end dynamic reference ) # end capture /$1$tag/x; # end of substitution } $seq =~ s/__-+/__/g; say $seq; } __END__ C:\Old_Data\perlp>perl dynamic_regex.pl AT___CGGA___TCTGGC A-C___G--CTG___GC C:\Old_Data\perlp> [download]	[reply] [d/l] [select]
Re^2: Regex to match range of characters broken by dashes by AnomalousMonk (Archbishop) on Jul 17, 2016 at 23:47 UTC
A fly in the ointment was in the code `for our $s (@stops)`. The code wouldn't work here with a 'my' declaration. 'our' was necessary. This is a bug that was corrected in Perl version 5.18 IIRC. With this correction, lexical variables always work as expected in `"(?{ code })"` and `"(??{ code })"` regex constructs. The dynamic regex form was necessary because the count of the quantifier changed for each iteration of the 'for' loop `($s-1)`. I don't see the necessity here. Except for the fact that aliasing into the `@stop` array makes calculating the quantifier a bit tedious, it can all be written normally, given that the `s///` match regex is, by default, re-compiled on each `s///` execution: `c:\@Work\Perl\monks>perl -wMstrict -le "my @stops = (2,6); ;; my $tag = '___'; ;; for ('ATCGGATCTGGC', 'A-C-G--CTGGC') { my $seq = $_; printf qq{'$seq' -> }; ;; for our $s (@stops) { local our $q = $s - 1; $seq =~ s/ ((?:[TAGC][^TAGC]){$q} [TAGC]) /$1$tag/x; } print qq{'$seq'}; } " 'ATCGGATCTGGC' -> 'AT___CGGA___TCTGGC' 'A-C-G--CTGGC' -> 'A-C___-G--CTG___GC'` [download] And except for say, it works under Perl version 5.8.9. See also Re: Regex to match range of characters broken by dashes Update 2 for another `for`-loop example. Update:* I've based my code example on your original code, prior to adding the second `s///` fixup. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Regex to match range of characters broken by dashes by Cristoforo (Curate) on Jul 18, 2016 at 01:26 UTC
Thanks for pointing out points in my solution that can be stated cleaner. This is a bug that was corrected in Perl version 5.18 IIRC. With this correction, lexical variables always work as expected in "(?{ code })" and "(??{ code })" regex constructs. I wasn't aware of that bug. And your `local our $q = $s - 1;` fixes that. it can all be written normally, given that the s/// match regex is, by default, re-compiled on each s/// execution That is a nice solution! The `(??{. . .})` construct wasn't necessary.	[reply] [d/l] [select]
Re^4: Regex to match range of characters broken by dashes by AnomalousMonk (Archbishop) on Jul 18, 2016 at 04:07 UTC


We don't bite newbies here... much
	PerlMonks