Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Regular expression for a comma separated string

by naderra (Novice)
on Nov 23, 2014 at 23:22 UTC ( [id://1108194]=perlquestion: print w/replies, xml ) Need Help??

naderra has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I would like to define a regular expression that matches strings consisting of two sub-strings separated by a single comma. Each sub-string may not be empty, and consists exclusively and without repetition of the characters 'A', 'G', 'C' and 'T'. Thus, the pattern should match strings such as:

A,G
AG,CT
TC,CA <- correct, 1st and 2nd sub-strings may have characters in common
( as long as these are not repeated within the sub-string ).
GAT,CGA
CGAT,TG <- correct, sub-strings may be of different length.
etc ...

and should not match:

,G <- missing 1st sub-string
ACGT <- missing comma
X,A <- incorrect character X
AA,G <- repetition of character A in 1st substring
AC,GGC <- repetition of character G in 2nd sub-string
ATGA,TGG <- repetition in both sub-strings
ATCXG,AAC <- incorrect character X and repetition in 2nd
sub-string
etc ...

So far I have:

/^(?=[ACGT]{1,4},[ACGT]{1,4}$)(?!.*(.).*\1.*,)(?!,.*(.).*\1).*$/

/^(?=[ACGT]{1,4},[ACGT]{1,4}$)(?!.*(.).*\g{1}.*,)(?!,.*(.).*\g{1}).*$/

also tried joining the capture groups with:

/^(?=[ACGT]{1,4},[ACGT]{1,4}$)(?!.*(.).*\g{1}.*,.*(.).*\g{2}).*$/

Now, (?=[ACGT]{1,4},[ACGT]{1,4}$) seems to match the "two sub-strings separated by a single comma" and "consists exclusively of the characters 'A', 'G', 'C' and 'T'" through out the string;

(?!.*(.).*\1.*,)seems to match "without repetition" up to the comma.

However, (?!,.*(.).*\1)appears not to be ensuring that it doesn't match a repeated character after the comma.

I'd greatly appreciate replies with clues and/or patterns that help with the desired matching.

Using perl v5.18.2

Thanks in advance

Robert

Replies are listed 'Best First'.
Re: Regular expression for a comma separated string
by Loops (Curate) on Nov 23, 2014 at 23:43 UTC

    Hi,

    Hopefully this does the trick for you, using negative lookahead to make sure each character doesn't repeat before the comma (or end of string):

    my @tst = qw( A,G AG,CT TC,CA GAT,CGA CGAT,TG ,G ACGT X,A AA,G AC,GGC ATGA,TGG ATCXG,AAC ); for (@tst) { my $side = qr/(?:([ACGT])(?![^,]*\g{-1}))+/; print $_ . (/^$side,$side$/ ? ' good' : ' bad') . $/; }
    Prints:
    A,G good AG,CT good TC,CA good GAT,CGA good CGAT,TG good ,G bad ACGT bad X,A bad AA,G bad AC,GGC bad ATGA,TGG bad ATCXG,AAC bad

    Oops, more test coverage showed an issue. Needed to use relative group instead of "\1" as originally posted which had a problem with "AC,AC". Fixed.

      Loops, thank you kindly for your solution.

      For testing, the following code generated all, up to 4 in length, possible permutations with repetition of ACGT; the inclusion of incorrect characters can be added later by hand.

      test program:

      and then:

      $ ./generate_ACGT_01.pl > y $ cat y | ./test_ACGT_01.pl > x $ less x

      Robert

Re: Regular expression for a comma separated string
by wjw (Priest) on Nov 23, 2014 at 23:45 UTC

    What I generally do with such problems is break them into individual requirements in my code, then later attempt to put it all in one statement if possible/desired(I usually don't desire it).

    So it becomes a check and pass unless fail sequence of statements in my code. It is a lot easier for me to trouble-shoot that way. Those who are great at RegEx's may snicker at this approach, but hey, it works and I find it a lot less frustrating..

    Hope that is helpful...

    ...the majority is always wrong, and always the last to know about it...

    Insanity: Doing the same thing over and over again and expecting different results...

    A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is simply an inconvenient fact

Re: Regular expression for a comma separated string
by AnomalousMonk (Archbishop) on Nov 24, 2014 at 15:20 UTC

    A two-step approach like that suggested by Anonymonk above may be better, but here's an all-in-one. (Tested under 5.8.9 and 5.14.4.)

    Output:

    Update: The regex definition (tested under 5.14.4)
        my $no_repeat = qr{ (?! ($agct) $agct* \g-1) }xms;
    may be a bit less awkward than the one given above, but it uses the  \gn construct not introduced until Perl version 5.10, which you may not care about, but more importantly IMHO, it also introduces a capturing group into the  qr// object, which sometimes proves a headache in further regex construction.

Re: Regular expression for a comma separated string
by Anonymous Monk on Nov 23, 2014 at 23:43 UTC
    "Without repetition" sounds difficult. Why does it have to be a single monstrous regular expression? Is it just for fun?
      It doesn't have to be. I didn't mention earlier, I did try to break it down but with no success.
      /^(?=[ACGT]{1,4},[ACGT]{1,4}$).*$/ && /^(?!.*(.).*\1.*,).*$/ && /^(?!,.*(.).*\1).*$/
        Well then, I'd do it like that
        my @lst = qw( A,G AG,CT TC,CA GAT,CGA CGAT,TG ,G ACGT X,A AA,G AC,GGC ATGA,TGG ATCXG,AAC ATA,TG GTA,YC); for (@lst) { my $good = do { m/^ [ATCG]+ , [ATCG]+ $/x and not grep m/(.) .* \1/x, split ','; }; print $_, $good ? ' good' : ' bad', "\n"; }
        Output:
        A,G good AG,CT good TC,CA good GAT,CGA good CGAT,TG good ,G bad ACGT bad X,A bad AA,G bad AC,GGC bad ATGA,TGG bad ATCXG,AAC bad ATA,TG bad GTA,YC bad
Re: Regular expression for a comma separated string
by igoryonya (Pilgrim) on Nov 24, 2014 at 07:07 UTC
    I didn't test it, but see if it works:
    ([AGCT])(?!\1)([AGCT])(?!\1|\2)([AGCT])(?!\1|\2|\3)[AGCT],([AGCT])(?!\ +4)([AGCT])(?!\4|\5)([AGCT])(?!\4|\5|\6)[AGCT]
    No, it would only do for 4 char strings only on each side.
    What about:
    my $str = 'some_str'; my @matches = ($str =~ /^([AGCT]{1,4}),([AGCT]{1,4})$/); if(2 == scalar @matches){ for my $match (@matches){ my %letters = (); map { $letters{$_} } ($match =~ /(.)/g); exit 0 if(scalar keys %letters != length $match); } }else{ exit 0; } print $str;
    :) also didn't test, maybe doesn't work.
Re: Regular expression for a comma separated string
by CountZero (Bishop) on Nov 25, 2014 at 10:14 UTC
    Why not let Perl write the regex for you?
    use Modern::Perl '2014'; use Algorithm::Permute; my @elements = qw/ A C T G /; my @permutations; Algorithm::Permute::permute { push @permutations, ( join '(?:', @eleme +nts ) . ')?)?)?'} @elements; my $regex = join '|', @permutations; $regex = qr/^(?:$regex)$/; say "Testing"; for my $length ( 1 .. 4 ) { my $p = new Algorithm::Permute( [ qw/ A C T G A C T G A C T G A C T G X +/, ' ' ], $length ); while ( my @res = $p->next ) { my $test = join( '', @res ); print "$test\t"; say $test=~ /$regex/ ? 'Accept' : 'Reject'; } }
    Sample output:
    Testing A Accept C Accept T Accept G Accept (...) X Reject Reject CA Accept AC Accept TC Accept CT Accept (...) GA Accept AA Reject CC Reject (...) AG Accept XC Reject CX Reject XT Reject (...) CTGA Accept TCGA Accept TGCA Accept TGAC Accept CAGA Reject ACGA Reject (...)
    The final regex for your pattern will then be /^$regex\s*,\s*$regex$/

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1108194]
Approved by Loops
Front-paged by CountZero
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-19 07:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found