Re^4: Splitting only on internal pattern, not at start or end of string

If you want to use split, you need to apply lookbehind and lookahead assertions in the regex, to keep the letters A, G, C and T out of the match:

my @info = split /(?<=[ATGC])N+(?=[ATGC])/, $line;
[download]

(Waiting for AnomalousMonks expert answer...)

Comment on Re^4: Splitting only on internal pattern, not at start or end of string Download Code

Replies are listed 'Best First'.
Re^5: Splitting only on internal pattern, not at start or end of string by AnomalousMonk (Archbishop) on Jan 16, 2014 at 23:07 UTC
Waiting for `AnomalousMonk`s expert answer... "ex" == "formerly" "spurt" == "a drip under pressure" "expert" == "ex" + "spurt" "expert" == "formerly a drip under pressure" I was thinking of something along the lines of johngg's extractive approach: `@ra = $string =~ m{ [^Nn]+ }xmsg` I shied away from `[ACGT]+` because the presence of 'N' suggests the presence of other sequence characters (codon sequences? protein sequences? I'm not a bio-guy) than these. However, the problem with `[^Nn]+` is it assumes that the input sequences are correct: any junk other than 'N' or 'n' that happens to be present will also be extracted. Also, I share the confusion of others about what should happen to leading and trailing "NNN..." sub-sequences.	[reply] [d/l] [select]
Re^5: Splitting only on internal pattern, not at start or end of string by robby_dobby (Hermit) on Jan 16, 2014 at 10:25 UTC
But that's exactly what I wanted to avoid. It's overkill for this kind of thing. :-)	[reply]