Re^2: Splitting only on internal pattern, not at start or end of string

Replies are listed 'Best First'.
Re^3: Splitting only on internal pattern, not at start or end of string by robby_dobby (Hermit) on Jan 16, 2014 at 10:18 UTC
Crap! What was I thinking? Yes, split is not the right solution for this situation. OP, apologies - please take johngg's solution. A global match is a better solution than mine. Update: added link to solution I was referring to	[reply]
Re^4: Splitting only on internal pattern, not at start or end of string by hdb (Monsignor) on Jan 16, 2014 at 10:24 UTC
If you want to use split, you need to apply lookbehind and lookahead assertions in the regex, to keep the letters A, G, C and T out of the match: `my @info = split /(?<=[ATGC])N+(?=[ATGC])/, $line;` [download] (Waiting for AnomalousMonks expert answer...)	[reply] [d/l]
Re^5: Splitting only on internal pattern, not at start or end of string by AnomalousMonk (Archbishop) on Jan 16, 2014 at 23:07 UTC
Waiting for `AnomalousMonk`s expert answer... "ex" == "formerly" "spurt" == "a drip under pressure" "expert" == "ex" + "spurt" "expert" == "formerly a drip under pressure" I was thinking of something along the lines of johngg's extractive approach: `@ra = $string =~ m{ [^Nn]+ }xmsg` I shied away from `[ACGT]+` because the presence of 'N' suggests the presence of other sequence characters (codon sequences? protein sequences? I'm not a bio-guy) than these. However, the problem with `[^Nn]+` is it assumes that the input sequences are correct: any junk other than 'N' or 'n' that happens to be present will also be extracted. Also, I share the confusion of others about what should happen to leading and trailing "NNN..." sub-sequences.	[reply] [d/l] [select]
Re^5: Splitting only on internal pattern, not at start or end of string by robby_dobby (Hermit) on Jan 16, 2014 at 10:25 UTC
But that's exactly what I wanted to avoid. It's overkill for this kind of thing. :-)	[reply]