hailholyghost has asked for the wisdom of the Perl Monks concerning the following question:

Hello perlmonks,

I am attempting pattern match on lines like this:

"Klhl21    NM_001033352    chr4    +    152008890    152017677    152008942    152015628    4    152008890,152012299,152014306,152015334,    152009963,152012705,152014379,152017677,"

I don't know how to write regex with "=~ m/" with an unknown number of commas to get the second two integer groups into memory. There can be a lot of these integers and writing a regex manually is impractical. There must be a smarter way to match the last two integer groups. So far, what I have is:

if (/^(.*)\s+(.*)\s+chr(.*)\s+([+-])\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s ++(\d+)/) {#indexed by RefSeqID

this reads all data but the last two integer groups into %D.

How can I pattern match the last two integer groups, with unknown numbers of alternating "(\d+),(\d+),(\d+)"?

Thanks,
-DEC

Replies are listed 'Best First'.
Re: Alternating Integers and Commas in Regex
by hdb (Monsignor) on Jan 20, 2015 at 14:24 UTC

    You could use a regex like ([\d,]+) and then use split /,/ on the result in a second step.

      thank you so much hdb!!!!! the match $10 can be split thusly: @{$D{$2}{exon_starts}} = split(/,/,$10);
Re: Alternating Integers and Commas in Regex
by johngg (Canon) on Jan 20, 2015 at 14:41 UTC

    Perhaps a two-stage split rather than a regex?

    $ perl -Mstrict -Mwarnings -E ' my $str = q{Klhl21 NM_001033352 chr4 + 152008890 152017677 152008942 1 +52015628 4 152008890,152012299,152014306,152015334, 152009963,1520127 +05,152014379,152017677,}; say for map { split m{,} } split m{\s+}, $str;' Klhl21 NM_001033352 chr4 + 152008890 152017677 152008942 152015628 4 152008890 152012299 152014306 152015334 152009963 152012705 152014379 152017677 $

    I hope this is helpful.

    Cheers,

    JohnGG

Re: Alternating Integers and Commas in Regex
by davido (Cardinal) on Jan 20, 2015 at 17:06 UTC

    I'm sorry, I can't reconcile "the second two integer groups" with "the last two integer groups". Which is it that you want? And by "two integer groups" do you mean "groups of two integers"?

    It seems like you want the final "152014379,152017677", in which case you could just say:

    my ( $last, $two ) = $string =~ m/(\d+),(\d+)$/;

    Or are you trying to get pairs?

    my @pairs; while( $string =~ m/(\d+),(\d+)/g ) { push @pairs, [$1,$2]; }

    Dave

Re: Alternating Integers and Commas in Regex
by locked_user sundialsvc4 (Abbot) on Jan 20, 2015 at 18:17 UTC

    Your approach can also depend in-part on how confident you are that the entire file always conforms to the same predictable format, e.g. that a line always contains ten split-groups, that there are always exactly two groups-of-interest, that they are always #9 and #10, that the comma groups always end with a trailing comma.   If you can, indeed, be confident that the data coming in to the program is always good and clean, then go straight for the simple, as previously described.

    On the other hand, I like to write programs like this so they’re a bit skeptical.   They check the number of split-groups:   there should be exactly ten.   They apply a regex pattern to groups #9 and #10 which verify that the data does consist of alternating groups of commas and one-or-more digits, with a trailing comma.   And they die if anything is out of the ordinary, such that the program is not only retrieving wanted data out of the file, but sniffing it for funny smells.   If the program runs to completion, this becomes reason to believe that the file did not contain “garbage in,” thus probably no “garbage out.”   Unless you take the time to look, you don’t know, and so a little bit of “I’m from Missouri” can really help in debugging the system of which this one program is a part.