Splitting only on internal pattern, not at start or end of string

BiologySwede has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Splitting only on internal pattern, not at start or end of string by hdb (Monsignor) on Jan 16, 2014 at 10:19 UTC
Instead of splitting you could use a regex like `[AGCT]+` to pick the bits you need with optional Ns at the beginning or end of the string (modifying robby_dobby's example from above): `use strict; use warnings; while (my $line = <DATA>) { my @info = $line =~ /((?:^N+)?[ATGC]+(?:N+$)?)/g; print join(", ", @info), "\n"; } __DATA__ NNNAAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN AAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT` [download]	[reply] [d/l] [select]
Re: Splitting only on internal pattern, not at start or end of string by oiskuu (Hermit) on Jan 16, 2014 at 10:27 UTC
The pattern for split isn't terribly complicated either: `my @contigs = split /(?<=[^N])N++\B/, $sequence{$key};` [download] Update: `/[^N]\KN++\B/` with a \K will also work, as long as you don't capture the gaps.	[reply] [d/l] [select]
Re: Splitting only on internal pattern, not at start or end of string by johngg (Canon) on Jan 16, 2014 at 13:09 UTC
All of the split solutions using look-arounds that have been posted so far have problems coping with Ns at the beginning or end of the string. If you want to use split I think the simplest approach would be to combine it with grep and length, splitting on one or more Ns without any look-arounds. `$ perl -E ' > $seq = q{NNACGTNNNACGTNACGTNN}; > say for grep length, split m{N+}, $seq;' ACGT ACGT ACGT $` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l]
Re^2: Splitting only on internal pattern, not at start or end of string by Anonymous Monk on Jan 16, 2014 at 14:27 UTC
Thanks everyone, this is all very helpful, and very much a great learning experience for me. I see now that indeed the first solution will remove characters I want to keep, so I will update my script as necessary.	[reply]
Re^2: Splitting only on internal pattern, not at start or end of string by Anonymous Monk on Jan 16, 2014 at 15:13 UTC
One wonders if BiologySwede had not intended to not keep the leading/trailing N's, or not?	[reply]
Re: Splitting only on internal pattern, not at start or end of string by kcott (Archbishop) on Jan 16, 2014 at 15:53 UTC
G'day BiologySwede, Welcome to the monastery. There's some issues with what you've posted: You didn't state whether any sequences could contain no `N`s. I've assumed the sequences might not have `N`s. The code I've provided below can be shortened if that's not the case; however, it will work with either case as written. You provided an example of a problematic sequence but didn't say whether: you didn't want the initial zero-length string that your split would produce, or you actually wanted to retain the leading (and/or trailing) `N`s. I've assumed (1). The following code eliminates the need for an interim `%sequence` hash, requires no regex for split and reduces your code substantially (all processing occurs in a single statement). Also note that I've added some additional test data. `#!/usr/bin/env perl -l use strict; use warnings; /^[^>]/ && do { y/N/ /; print join "\n" => split } while <DATA>; __DATA__ >fasta1 NNNAGTCTGCAAANAATTTGCGGCTCACAAT >fasta2 CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN >mytest1 NNNACGTNNTGCANN >mytest2 ACGTNNCGTANNNNNGTACNTACG >mytest3 TGCA` [download] Output: `AGTCTGCAAA AATTTGCGGCTCACAAT CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGG ACGT TGCA ACGT CGTA GTAC TACG TGCA` [download] Here's some additional tips regarding the code you posted: Hashes have no inherent ordering. "`keys %sequence`" will probably return a different order to that in the original fasta file. I don't know if that's important to you. Get into the habit of using the 3-argument form of open with a lexical filehandle. It's easy to forget to check for I/O errors (as you did with "`open (OUTFILE,">fasta_report.txt");`"). Consider using the autodie pragma: it's a lot less work for you and removes the possibility forgetting the I/O checks. -- Ken	[reply] [d/l] [select]
Re: Splitting only on internal pattern, not at start or end of string by robby_dobby (Hermit) on Jan 16, 2014 at 09:27 UTC
Hello, Since you already know that your fasta string can only contain characters A,T,G,C other than N, the simplest way is to just use that bit of information. :-) Your regex is fine, but the problem is that it can match 'N' anywhere in the string. Here's how we can use our little tidbit to advantage. Change your regex to: `[ATGC]N+[ATGC]`. Here's some sample code demonstrating it: `use strict; use warnings; while (my $line = <DATA>) { chomp $line; # The below regex tells perl to look for # any of A,T,G,C followed by a string of # one or more Ns, followed by A,T,G,C. my @info = split /[ATGC]N+[ATGC]/, $line; print join(", ", @info), "\n"; } __DATA__ NNNAAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN AAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT` [download] Update: My split solution has a problem in that it loses one of `[ATGC]` on either side of the internal pattern. Please use this solution by hdb or johngg's extractive matching. If you prefer to use lookaround assertions, here's one by oiskuu.	[reply] [d/l] [select]
Re^2: Splitting only on internal pattern, not at start or end of string by hdb (Monsignor) on Jan 16, 2014 at 10:10 UTC
Did you realize that you lose one letter each side of the Ns from your sequence?	[reply]
Re^3: Splitting only on internal pattern, not at start or end of string by robby_dobby (Hermit) on Jan 16, 2014 at 10:18 UTC
Crap! What was I thinking? Yes, split is not the right solution for this situation. OP, apologies - please take johngg's solution. A global match is a better solution than mine. Update: added link to solution I was referring to	[reply]
Re^4: Splitting only on internal pattern, not at start or end of string by hdb (Monsignor) on Jan 16, 2014 at 10:24 UTC
Re^5: Splitting only on internal pattern, not at start or end of string by AnomalousMonk (Archbishop) on Jan 16, 2014 at 23:07 UTC
Re^5: Splitting only on internal pattern, not at start or end of string by robby_dobby (Hermit) on Jan 16, 2014 at 10:25 UTC
Re^2: Splitting only on internal pattern, not at start or end of string by BiologySwede (Initiate) on Jan 16, 2014 at 09:51 UTC
That is totally awesome, many thanks!	[reply]
Re^3: Splitting only on internal pattern, not at start or end of string by johngg (Canon) on Jan 16, 2014 at 10:08 UTC
Unfortunately, also totally wrong as it will consume an A, C, G or T adjacent to the Ns. Rather than split do a global match for one or more of A, C, G or T. `$ perl -E ' $seq = q{NNACGTNNNACGTNACGTNN}; say for split m{[ACGT]N+[ACGT]}, $seq; say q{-} x 10; say for $seq =~ m{[ACGT]+}g;' NNACG CG CGTNN ---------- ACGT ACGT ACGT $` [download] I hope this is helpful. Update: Corrected wording, s/more than one of/one or more of/ Cheers, JohnGG	[reply] [d/l]
Re: Splitting only on internal pattern, not at start or end of string by choroba (Cardinal) on Jan 16, 2014 at 15:32 UTC
The problem appears when I also have Ns at the beginning or end of the sequence So, remove the N's at the beginning and at the end, and use your old algorithm: `$sequence =~ s/^N+\|N+$//g;` [download] لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: Splitting only on internal pattern, not at start or end of string by hdb (Monsignor) on Jan 16, 2014 at 15:37 UTC
This is the one issue in this thread: the OP has not specified if he wants to keep leading and trailing Ns or whether he wants to have them removed. Some answers assume the former, others the latter....	[reply]