Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Regex to match range of characters broken by dashes

by Q.and (Novice)
on Jul 15, 2016 at 21:44 UTC ( [id://1167857]=perlquestion: print w/replies, xml ) Need Help??

Q.and has asked for the wisdom of the Perl Monks concerning the following question:

I have not been able to make a regex that matches a discontinuous string and still follows the numeric range set in {}. See below, the first example gives correct behavior only when the string is continuous and the second demonstrates its flaws when the string is interrupted by dashes. Example one:

$seq = 'ATCGGATCTGGC'; $tag = '___'; $seq =~ s/[ATGC]{2}/$&$tag/; $seq =~ s/$tag[ATGC]{4}/$&$tag/;

printing $seq will output:

AT___CGGA___TCTGGC

And that’s exactly what they should do, BUT, if $seq has dashes, the regexes are not appropriate. In example 2 below, say that all is the same as above, except that

$seq = 'A-C-G--CTGGC';

Printing $seq now outputs:

A-C-G--CT___GGC

instead of the desired output where it effectively ignores dashes and only count letters:

A-C___G--CTG___GC

Any ideas on how I can write the regexes to match when gaps are included?

Replies are listed 'Best First'.
Re: Regex to match range of characters broken by dashes
by choroba (Cardinal) on Jul 15, 2016 at 22:20 UTC
    Where did the dash on the 4th position in the second string go?

    Assuming it should have stayed, there's a not-only-regex solution:

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my @expected = ('AT___CGGA___TCTGGC', 'A-C___-G--CTG___GC'); # ^ my $tag = '___'; for ('ATCGGATCTGGC', 'A-C-G--CTGGC', ) { my $seq = $_; my $pos = 0; my $count_letters = 0; for my $stop (2, 6) { until ($count_letters == $stop) { $seq =~ /([ATGC])/g; $pos = pos $seq; $count_letters++; } substr $seq, $pos, 0, $tag; pos $seq = $pos + length $tag; } my $exp = shift @expected; say "$seq\t$exp\t", $exp eq $seq; }

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Regex to match range of characters broken by dashes
by RonW (Parson) on Jul 15, 2016 at 22:17 UTC
    ... where it effectively ignores dashes....

    Preprocess the sequence to get ride of the dashes:

    $seq =~ s/-//g; # remove any dashes, first $seq =~ s/[ATGC]{2}/$&$tag/; $seq =~ s/$tag[ATGC]{4}/$&$tag/;

    Update: As pointed out, in some circumstances, the desired output can have dashes.

    So, add a look-ahead so that only single dashes are removed:

    $seq =~ s/-(?!-)//g;

      But his desired output has dashes.

Re: Regex to match range of characters broken by dashes
by hippo (Bishop) on Jul 15, 2016 at 21:57 UTC

    This has the suspicious look of an XY Problem to me. Would you care to describe the bigger picture?

Re: Regex to match range of characters broken by dashes
by AnomalousMonk (Archbishop) on Jul 16, 2016 at 06:30 UTC

    Like choroba, I'm wondering: What's supposed to happen to the dash in the 4th position in the second string?
        A-C-G--CTGGC
           ^ dash in 4th position

    Assuming it should be replaced by  $tag because it's between the quantified groups of bases, here's a multi-regex solution. (Warning: Needs Perl version 5.10+ for the  \K regex operator — but I can get around that fairly easily if needed.)

    c:\@Work\Perl>perl -wMstrict -le "use 5.010; ;; use Test::More 'no_plan'; use Test::NoWarnings; ;; my $tag = '___'; ;; VECTOR: for my $ar_vector ( [ qw(ATCGGATCTGGC AT___CGGA___TCTGGC) ], [ qw(A-C-G--CTGGC A-C___G--CTG___GC) ], ) { if (! ref $ar_vector) { note $ar_vector; next VECTOR; } ;; my ($seq, $expected) = @$ar_vector; my $got = xform($seq); is $got, $expected, qq{'$seq' -> '$expected'}; } ;; done_testing; ;; sub xform { my ($s) = @_; ;; my $u = qr{ [ATGC] -*? }xms; ;; $s =~ s{ $u{2} \K -* }{$tag}xms; $s =~ s{ $u{4} \K -* }{$tag}xms; return $s; } " ok 1 - 'ATCGGATCTGGC' -> 'AT___CGGA___TCTGGC' ok 2 - 'A-C-G--CTGGC' -> 'A-C___G--CTG___GC' 1..2 ok 3 - no warnings 1..3
    Of course, more test cases are highly encouraged!

    Update: And yes, this does seem like an XY Problem.

    Update 2: Here's the pre-5.10 (no \K) version of the code (tested):
        $s =~ s{ ($bu{2}) -* }{$1$tag}xms;
        $s =~ s{ ($bu{4}) -* }{$1$tag}xms;
    And versions, also tested, consolidating the two substitutions in a for-loop:
        $s =~ s{  (?:$bu){$_} \K -* }   {$tag}xms for 2, 4;  # 5.10+
        $s =~ s{ ((?:$bu){$_})   -* } {$1$tag}xms for 2, 4;  # pre-5.10
    In all these variations,
        my $bu = qr{ [ATGC] -*? }xms;


    Give a man a fish:  <%-{-{-{-<

Re: Regex to match range of characters broken by dashes
by CountZero (Bishop) on Jul 16, 2016 at 11:00 UTC
    This seems to work:
    use Modern::Perl qw/2015/; my $seq = 'A-C-G--CTGGC'; my $tag = '___'; $seq =~ s/([ATGC]-*[ATGC])-*/$1$tag/; $seq =~ s/($tag[ATGC]-*[ATGC]-*[ATGC]-*[ATGC])-*/$1$tag/; say $seq;

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Regex to match range of characters broken by dashes
by Anonymous Monk on Jul 15, 2016 at 22:08 UTC
    #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1167857 use strict; use warnings; #my $seq = 'ATCGGATCTGGC'; my $seq = 'A-C-G--CTGGC'; my $tag = '___'; $seq =~ s/(?:-*[ATGC]-*){2}/$& =~ s#^-+|-+$##gr . $tag/e; $seq =~ s/$tag(?:-*[ATGC]-*){4}/$& =~ s#^-+|-+$##gr . $tag/e; print $seq;
Re: Regex to match range of characters broken by dashes
by Cristoforo (Curate) on Jul 17, 2016 at 19:20 UTC
    Here is a solution using a dynamic regex construct. I've never used it before and learned more about it here.

    A fly in the ointment was in the code for our $s (@stops). The code wouldn't work here with a 'my' declaration. 'our' was necessary.

    This is fairly readable and should work for any number of groups (provided they don't exceed the count of fasta characters in a string of them. I didn't test for that to see how it behaved).

    The dynamic regex form was necessary because the count of the quantifier changed for each iteration of the 'for' loop ($s-1).

    The printout after the __END__ token shows the results of the run.

    Update: Added a final substitution to remove dashes preceding and following the double underscore (as he desired in his post. Missed that.)

    #!/usr/bin/perl use strict; use warnings; use 5.014; my @stops = (2,6); # group by 2 then 4 (6 == 2 + 4) my $tag = '___'; for ('ATCGGATCTGGC', 'A-C-G--CTGGC') { my $seq = $_; for our $s (@stops) { # necessary to use 'our' instead of 'my' $seq =~ s/ ( # begin capture (??{ # dynamic regex "(?:[TAGC][^TAGC]*)" . # group to apply quantifi +er to "{" . ($s-1) . "}" . # quantifier "[TAGC]" # end token }) # end dynamic reference ) # end capture /$1$tag/x; # end of substitution } $seq =~ s/__-+/__/g; say $seq; } __END__ C:\Old_Data\perlp>perl dynamic_regex.pl AT___CGGA___TCTGGC A-C___G--CTG___GC C:\Old_Data\perlp>
      A fly in the ointment was in the code for our $s (@stops). The code wouldn't work here with a 'my' declaration. 'our' was necessary.

      This is a bug that was corrected in Perl version 5.18 IIRC. With this correction, lexical variables always work as expected in  "(?{ code })" and  "(??{ code })" regex constructs.

      The dynamic regex form was necessary because the count of the quantifier changed for each iteration of the 'for' loop ($s-1).

      I don't see the necessity here. Except for the fact that aliasing into the  @stop array makes calculating the quantifier a bit tedious, it can all be written normally, given that the  s/// match regex is, by default, re-compiled on each  s/// execution:

      c:\@Work\Perl\monks>perl -wMstrict -le "my @stops = (2,6); ;; my $tag = '___'; ;; for ('ATCGGATCTGGC', 'A-C-G--CTGGC') { my $seq = $_; printf qq{'$seq' -> }; ;; for our $s (@stops) { local our $q = $s - 1; $seq =~ s/ ((?:[TAGC][^TAGC]*){$q} [TAGC]) /$1$tag/x; } print qq{'$seq'}; } " 'ATCGGATCTGGC' -> 'AT___CGGA___TCTGGC' 'A-C-G--CTGGC' -> 'A-C___-G--CTG___GC'
      And except for say, it works under Perl version 5.8.9. See also Re: Regex to match range of characters broken by dashes Update 2 for another for-loop example.

      Update: I've based my code example on your original code, prior to adding the second  s/// fixup.


      Give a man a fish:  <%-{-{-{-<

        Thanks for pointing out points in my solution that can be stated cleaner.

        This is a bug that was corrected in Perl version 5.18 IIRC. With this correction, lexical variables always work as expected in "(?{ code })" and "(??{ code })" regex constructs.

        I wasn't aware of that bug. And your local our $q = $s - 1; fixes that.

        it can all be written normally, given that the s/// match regex is, by default, re-compiled on each s/// execution

        That is a nice solution! The (??{. . .}) construct wasn't necessary.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1167857]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2024-04-25 11:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found