in reply to Re^2: Splitting only on internal pattern, not at start or end of string
in thread Splitting only on internal pattern, not at start or end of string

Crap! What was I thinking? Yes, split is not the right solution for this situation.

OP, apologies - please take johngg's solution. A global match is a better solution than mine.

Update: added link to solution I was referring to
  • Comment on Re^3: Splitting only on internal pattern, not at start or end of string

Replies are listed 'Best First'.
Re^4: Splitting only on internal pattern, not at start or end of string
by hdb (Monsignor) on Jan 16, 2014 at 10:24 UTC

    If you want to use split, you need to apply lookbehind and lookahead assertions in the regex, to keep the letters A, G, C and T out of the match:

    my @info = split /(?<=[ATGC])N+(?=[ATGC])/, $line;
    (Waiting for AnomalousMonks expert answer...)
      Waiting for AnomalousMonks expert answer...

      "ex" == "formerly"     "spurt" == "a drip under pressure"
      "expert" == "ex" + "spurt"
      "expert" == "formerly a drip under pressure"

      I was thinking of something along the lines of johngg's extractive approach:
          @ra = $string =~ m{ [^Nn]+ }xmsg
      I shied away from  [ACGT]+ because the presence of 'N' suggests the presence of other sequence characters (codon sequences? protein sequences? I'm not a bio-guy) than these. However, the problem with  [^Nn]+ is it assumes that the input sequences are correct: any junk other than 'N' or 'n' that happens to be present will also be extracted. Also, I share the confusion of others about what should happen to leading and trailing "NNN..." sub-sequences.

      But that's exactly what I wanted to avoid. It's overkill for this kind of thing. :-)