Re: Splitting only on internal pattern, not at start or end of string
by hdb (Monsignor) on Jan 16, 2014 at 10:19 UTC
|
Instead of splitting you could use a regex like [AGCT]+ to pick the bits you need with optional Ns at the beginning or end of the string (modifying robby_dobby's example from above):
use strict;
use warnings;
while (my $line = <DATA>) {
my @info = $line =~ /((?:^N+)?[ATGC]+(?:N+$)?)/g;
print join(", ", @info), "\n";
}
__DATA__
NNNAAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT
CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN
AAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT
| [reply] [d/l] [select] |
Re: Splitting only on internal pattern, not at start or end of string
by oiskuu (Hermit) on Jan 16, 2014 at 10:27 UTC
|
my @contigs = split /(?<=[^N])N++\B/, $sequence{$key};
Update: /[^N]\KN++\B/ with a \K will also work, as long as you don't capture the gaps.
| [reply] [d/l] [select] |
Re: Splitting only on internal pattern, not at start or end of string
by johngg (Canon) on Jan 16, 2014 at 13:09 UTC
|
All of the split solutions using look-arounds that have been posted so far have problems coping with Ns at the beginning or end of the string. If you want to use split I think the simplest approach would be to combine it with grep and length, splitting on one or more Ns without any look-arounds.
$ perl -E '
> $seq = q{NNACGTNNNACGTNACGTNN};
> say for grep length, split m{N+}, $seq;'
ACGT
ACGT
ACGT
$
I hope this is of interest.
| [reply] [d/l] |
|
|
Thanks everyone, this is all very helpful, and very much a great learning experience for me. I see now that indeed the first solution will remove characters I want to keep, so I will update my script as necessary.
| [reply] |
|
|
| [reply] |
Re: Splitting only on internal pattern, not at start or end of string
by kcott (Archbishop) on Jan 16, 2014 at 15:53 UTC
|
G'day BiologySwede,
Welcome to the monastery.
There's some issues with what you've posted:
-
You didn't state whether any sequences could contain no Ns.
I've assumed the sequences might not have Ns.
The code I've provided below can be shortened if that's not the case; however, it will work with either case as written.
-
You provided an example of a problematic sequence but didn't say whether:
-
you didn't want the initial zero-length string that your split would produce, or
-
you actually wanted to retain the leading (and/or trailing) Ns.
I've assumed (1).
The following code eliminates the need for an interim %sequence hash, requires no regex for split and reduces your code substantially (all processing occurs in a single statement). Also note that I've added some additional test data.
#!/usr/bin/env perl -l
use strict;
use warnings;
/^[^>]/ && do { y/N/ /; print join "\n" => split } while <DATA>;
__DATA__
>fasta1
NNNAGTCTGCAAANAATTTGCGGCTCACAAT
>fasta2
CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN
>mytest1
NNNACGTNNTGCANN
>mytest2
ACGTNNCGTANNNNNGTACNTACG
>mytest3
TGCA
Output:
AGTCTGCAAA
AATTTGCGGCTCACAAT
CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGG
ACGT
TGCA
ACGT
CGTA
GTAC
TACG
TGCA
Here's some additional tips regarding the code you posted:
-
Hashes have no inherent ordering. "keys %sequence" will probably return a different order to that in the original fasta file. I don't know if that's important to you.
-
Get into the habit of using the 3-argument form of open with a lexical filehandle.
-
It's easy to forget to check for I/O errors (as you did with "open (OUTFILE,">fasta_report.txt");").
Consider using the autodie pragma: it's a lot less work for you and removes the possibility forgetting the I/O checks.
| [reply] [d/l] [select] |
Re: Splitting only on internal pattern, not at start or end of string
by robby_dobby (Hermit) on Jan 16, 2014 at 09:27 UTC
|
Hello,
Since you already know that your fasta string can only contain characters A,T,G,C other than N, the simplest way is to just use that bit of information. :-)
Your regex is fine, but the problem is that it can match 'N' anywhere in the string. Here's how we can use our little tidbit to advantage. Change your regex to: [ATGC]N+[ATGC].
Here's some sample code demonstrating it:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
# The below regex tells perl to look for
# any of A,T,G,C followed by a string of
# one or more Ns, followed by A,T,G,C.
my @info = split /[ATGC]N+[ATGC]/, $line;
print join(", ", @info), "\n";
}
__DATA__
NNNAAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT
CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN
AAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT
Update: My split solution has a problem in that it loses one of [ATGC] on either side of the internal pattern.
Please use this solution by hdb or johngg's extractive matching. If you prefer to use lookaround assertions, here's one by oiskuu.
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
Crap! What was I thinking? Yes, split is not the right solution for this situation.
OP, apologies - please take johngg's solution. A global match is a better solution than mine.
Update: added link to solution I was referring to
| [reply] |
|
|
|
|
|
|
|
|
That is totally awesome, many thanks!
| [reply] |
|
|
Unfortunately, also totally wrong as it will consume an A, C, G or T adjacent to the Ns. Rather than split do a global match for one or more of A, C, G or T.
$ perl -E '
$seq = q{NNACGTNNNACGTNACGTNN};
say for split m{[ACGT]N+[ACGT]}, $seq;
say q{-} x 10;
say for $seq =~ m{[ACGT]+}g;'
NNACG
CG
CGTNN
----------
ACGT
ACGT
ACGT
$
I hope this is helpful.
Update: Corrected wording, s/more than one of/one or more of/
| [reply] [d/l] |
Re: Splitting only on internal pattern, not at start or end of string
by choroba (Cardinal) on Jan 16, 2014 at 15:32 UTC
|
The problem appears when I also have Ns at the beginning or end of the sequence
So, remove the N's at the beginning and at the end, and use your old algorithm:
$sequence =~ s/^N+|N+$//g;
| [reply] [d/l] |
|
|
| [reply] |