Regular expression for a comma separated string

naderra has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I would like to define a regular expression that matches strings consisting of two sub-strings separated by a single comma. Each sub-string may not be empty, and consists exclusively and without repetition of the characters 'A', 'G', 'C' and 'T'. Thus, the pattern should match strings such as:

A,G
AG,CT
TC,CA <- correct, 1st and 2nd sub-strings may have characters in common
( as long as these are not repeated within the sub-string ).
GAT,CGA
CGAT,TG <- correct, sub-strings may be of different length.
etc ...

and should not match:

,G <- missing 1st sub-string
ACGT <- missing comma
X,A <- incorrect character X
AA,G <- repetition of character A in 1st substring
AC,GGC <- repetition of character G in 2nd sub-string
ATGA,TGG <- repetition in both sub-strings
ATCXG,AAC <- incorrect character X and repetition in 2nd
sub-string
etc ...

So far I have:

/^(?=[ACGT]{1,4},[ACGT]{1,4}$)(?!.*(.).*\1.*,)(?!,.*(.).*\1).*$/

/^(?=[ACGT]{1,4},[ACGT]{1,4}$)(?!.*(.).*\g{1}.*,)(?!,.*(.).*\g{1}).*$/

also tried joining the capture groups with:

/^(?=[ACGT]{1,4},[ACGT]{1,4}$)(?!.*(.).*\g{1}.*,.*(.).*\g{2}).*$/

Now, (?=[ACGT]{1,4},[ACGT]{1,4}$) seems to match the "two sub-strings separated by a single comma" and "consists exclusively of the characters 'A', 'G', 'C' and 'T'" through out the string;

(?!.*(.).*\1.*,)seems to match "without repetition" up to the comma.

However, (?!,.*(.).*\1)appears not to be ensuring that it doesn't match a repeated character after the comma.

I'd greatly appreciate replies with clues and/or patterns that help with the desired matching.

Using perl v5.18.2

Thanks in advance

Robert

Comment on Regular expression for a comma separated string Select or Download Code

Replies are listed 'Best First'.
Re: Regular expression for a comma separated string by Loops (Curate) on Nov 23, 2014 at 23:43 UTC
Hi, Hopefully this does the trick for you, using negative lookahead to make sure each character doesn't repeat before the comma (or end of string): `my @tst = qw( A,G AG,CT TC,CA GAT,CGA CGAT,TG ,G ACGT X,A AA,G AC,GGC ATGA,TGG ATCXG,AAC ); for (@tst) { my $side = qr/(?:([ACGT])(?![^,]*\g{-1}))+/; print $_ . (/^$side,$side$/ ? ' good' : ' bad') . $/; }` [download] Prints: `A,G good AG,CT good TC,CA good GAT,CGA good CGAT,TG good ,G bad ACGT bad X,A bad AA,G bad AC,GGC bad ATGA,TGG bad ATCXG,AAC bad` [download] Oops, more test coverage showed an issue. Needed to use relative group instead of "\1" as originally posted which had a problem with "AC,AC". Fixed.	[reply] [d/l] [select]
Re^2: Regular expression for a comma separated string by naderra (Novice) on Nov 25, 2014 at 14:25 UTC
Loops, thank you kindly for your solution. For testing, the following code generated all, up to 4 in length, possible permutations with repetition of ACGT; the inclusion of incorrect characters can be added later by hand. Read more... Generate test data (1178 Bytes) test program: Read more... Test program (682 Bytes) and then: `$ ./generate_ACGT_01.pl > y $ cat y \| ./test_ACGT_01.pl > x $ less x` [download] Robert	[reply] [d/l] [select]
Re: Regular expression for a comma separated string by wjw (Priest) on Nov 23, 2014 at 23:45 UTC
What I generally do with such problems is break them into individual requirements in my code, then later attempt to put it all in one statement if possible/desired(I usually don't desire it). So it becomes a check and pass unless fail sequence of statements in my code. It is a lot easier for me to trouble-shoot that way. Those who are great at RegEx's may snicker at this approach, but hey, it works and I find it a lot less frustrating.. Hope that is helpful... ...the majority is always wrong, and always the last to know about it... Insanity: Doing the same thing over and over again and expecting different results... A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is simply an inconvenient fact wjw	[reply]
Re: Regular expression for a comma separated string by AnomalousMonk (Archbishop) on Nov 24, 2014 at 15:20 UTC
A two-step approach like that suggested by `Anonymonk` above may be better, but here's an all-in-one. (Tested under 5.8.9 and 5.14.4.) Read more... (3 kB) Output: Read more... (1177 Bytes) Update: The regex definition (tested under 5.14.4) `my $no_repeat = qr{ (?! ($agct) $agct* \g-1) }xms;` may be a bit less awkward than the one given above, but it uses the `\g`n construct not introduced until Perl version 5.10, which you may not care about, but more importantly IMHO, it also introduces a capturing group into the `qr//` object, which sometimes proves a headache in further regex construction.	[reply] [d/l] [select]
Re: Regular expression for a comma separated string by Anonymous Monk on Nov 23, 2014 at 23:43 UTC
"Without repetition" sounds difficult. Why does it have to be a single monstrous regular expression? Is it just for fun?	[reply]
Re^2: Regular expression for a comma separated string by naderra (Novice) on Nov 24, 2014 at 00:10 UTC
It doesn't have to be. I didn't mention earlier, I did try to break it down but with no success. `/^(?=[ACGT]{1,4},[ACGT]{1,4}$).$/ && /^(?!.(.).\1.,).$/ && /^(?!,.(.).\1).$/` [download]	[reply] [d/l]
Re^3: Regular expression for a comma separated string by Anonymous Monk on Nov 24, 2014 at 00:43 UTC
Well then, I'd do it like that `my @lst = qw( A,G AG,CT TC,CA GAT,CGA CGAT,TG ,G ACGT X,A AA,G AC,GGC ATGA,TGG ATCXG,AAC ATA,TG GTA,YC); for (@lst) { my $good = do { m/^ [ATCG]+ , [ATCG]+ $/x and not grep m/(.) .* \1/x, split ','; }; print $_, $good ? ' good' : ' bad', "\n"; }` [download] Output: `A,G good AG,CT good TC,CA good GAT,CGA good CGAT,TG good ,G bad ACGT bad X,A bad AA,G bad AC,GGC bad ATGA,TGG bad ATCXG,AAC bad ATA,TG bad GTA,YC bad` [download]	[reply] [d/l] [select]
Re: Regular expression for a comma separated string by igoryonya (Pilgrim) on Nov 24, 2014 at 07:07 UTC
I didn't test it, but see if it works: `([AGCT])(?!\1)([AGCT])(?!\1\|\2)([AGCT])(?!\1\|\2\|\3)[AGCT],([AGCT])(?!\ +4)([AGCT])(?!\4\|\5)([AGCT])(?!\4\|\5\|\6)[AGCT]` [download] No, it would only do for 4 char strings only on each side. What about: `my $str = 'some_str'; my @matches = ($str =~ /^([AGCT]{1,4}),([AGCT]{1,4})$/); if(2 == scalar @matches){ for my $match (@matches){ my %letters = (); map { $letters{$_} } ($match =~ /(.)/g); exit 0 if(scalar keys %letters != length $match); } }else{ exit 0; } print $str;` [download] :) also didn't test, maybe doesn't work.	[reply] [d/l] [select]
Re: Regular expression for a comma separated string by CountZero (Bishop) on Nov 25, 2014 at 10:14 UTC
Why not let Perl write the regex for you? use Modern::Perl '2014'; use Algorithm::Permute; my @elements = qw/ A C T G /; my @permutations; Algorithm::Permute::permute { push @permutations, ( join '(?:', @eleme +nts ) . ')?)?)?'} @elements; my $regex = join '\|', @permutations; $regex = qr/^(?:$regex)$/; say "Testing"; for my $length ( 1 .. 4 ) { my $p = new Algorithm::Permute( [ qw/ A C T G A C T G A C T G A C T G X +/, ' ' ], $length ); while ( my @res = $p->next ) { my $test = join( '', @res ); print "$test\t"; say $test=~ /$regex/ ? 'Accept' : 'Reject'; } } [download] Sample output: `Testing A Accept C Accept T Accept G Accept (...) X Reject Reject CA Accept AC Accept TC Accept CT Accept (...) GA Accept AA Reject CC Reject (...) AG Accept XC Reject CX Reject XT Reject (...) CTGA Accept TCGA Accept TGCA Accept TGAC Accept CAGA Reject ACGA Reject (...)` [download] The final regex for your pattern will then be `/^$regex\s,\s$regex$/` CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l] [select]


Welcome to the Monastery
	PerlMonks