in reply to Re: Splitting only on internal pattern, not at start or end of string
in thread Splitting only on internal pattern, not at start or end of string

Did you realize that you lose one letter each side of the Ns from your sequence?

  • Comment on Re^2: Splitting only on internal pattern, not at start or end of string

Replies are listed 'Best First'.
Re^3: Splitting only on internal pattern, not at start or end of string
by robby_dobby (Hermit) on Jan 16, 2014 at 10:18 UTC

    Crap! What was I thinking? Yes, split is not the right solution for this situation.

    OP, apologies - please take johngg's solution. A global match is a better solution than mine.

    Update: added link to solution I was referring to

      If you want to use split, you need to apply lookbehind and lookahead assertions in the regex, to keep the letters A, G, C and T out of the match:

      my @info = split /(?<=[ATGC])N+(?=[ATGC])/, $line;
      (Waiting for AnomalousMonks expert answer...)
        Waiting for AnomalousMonks expert answer...

        "ex" == "formerly"     "spurt" == "a drip under pressure"
        "expert" == "ex" + "spurt"
        "expert" == "formerly a drip under pressure"

        I was thinking of something along the lines of johngg's extractive approach:
            @ra = $string =~ m{ [^Nn]+ }xmsg
        I shied away from  [ACGT]+ because the presence of 'N' suggests the presence of other sequence characters (codon sequences? protein sequences? I'm not a bio-guy) than these. However, the problem with  [^Nn]+ is it assumes that the input sequences are correct: any junk other than 'N' or 'n' that happens to be present will also be extracted. Also, I share the confusion of others about what should happen to leading and trailing "NNN..." sub-sequences.

        But that's exactly what I wanted to avoid. It's overkill for this kind of thing. :-)