BiologySwede has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I am working with biological sequence data and I have a need to split sequences up at regions of one or more Ns (the other characters in the sequence are A,C,G, and T). The format is called fasta, with a name of the sequence starting with a > and the sequence on the next row. Here is an example:

>name1
AAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT

This sequence would correctly be split into AAAATATGACAAAGGGGTT and GATGTCTGGTCAATAGGAT, and this I can manage (example code below).

The problem appears when I also have Ns at the beginning or end of the sequence. I do not want to split at these positions, only when I have Ns internally. This sequence for example would cause me problems:

>name1
NNNAAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT

Could someone please help me get the regex right for the split function?

In my code I read a fasta-file (with multiple entries) in to a hash and then loop over the hash to produce the separate entries. See my code below

#!/usr/bin/perl use warnings; use strict; my $infile=$ARGV[0]; my $header; my %sequence=(); open FASTA, $infile or die "Couldn't open fasta-file"; open (OUTFILE,">fasta_report.txt"); # Populate a hash with the fasta-data # fasta example: #>fasta1 #NNNAGTCTGCAAANAATTTGCGGCTCACAAT #>fasta2 #CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN while (<FASTA>) { chomp; if (/^>(.*)$/){ $header=$1; } elsif (/^(\S+)$/){ $sequence{$header} .= $1 if $header; } } close FASTA; #Go through the hash and split at Ns foreach my $key (keys %sequence){ my @contigs = split (/N+/, $sequence{$key}); foreach my $element (@contigs){ print OUTFILE "$element\n"; } } close OUTFILE;

I very much appreciate any help you can give me. Thanks!

Replies are listed 'Best First'.
Re: Splitting only on internal pattern, not at start or end of string
by hdb (Monsignor) on Jan 16, 2014 at 10:19 UTC

    Instead of splitting you could use a regex like [AGCT]+ to pick the bits you need with optional Ns at the beginning or end of the string (modifying robby_dobby's example from above):

    use strict; use warnings; while (my $line = <DATA>) { my @info = $line =~ /((?:^N+)?[ATGC]+(?:N+$)?)/g; print join(", ", @info), "\n"; } __DATA__ NNNAAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN AAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT
Re: Splitting only on internal pattern, not at start or end of string
by oiskuu (Hermit) on Jan 16, 2014 at 10:27 UTC

    The pattern for split isn't terribly complicated either:

    my @contigs = split /(?<=[^N])N++\B/, $sequence{$key};
    Update: /[^N]\KN++\B/ with a \K will also work, as long as you don't capture the gaps.

Re: Splitting only on internal pattern, not at start or end of string
by johngg (Canon) on Jan 16, 2014 at 13:09 UTC

    All of the split solutions using look-arounds that have been posted so far have problems coping with Ns at the beginning or end of the string. If you want to use split I think the simplest approach would be to combine it with grep and length, splitting on one or more Ns without any look-arounds.

    $ perl -E ' > $seq = q{NNACGTNNNACGTNACGTNN}; > say for grep length, split m{N+}, $seq;' ACGT ACGT ACGT $

    I hope this is of interest.

    Cheers,

    JohnGG

      Thanks everyone, this is all very helpful, and very much a great learning experience for me. I see now that indeed the first solution will remove characters I want to keep, so I will update my script as necessary.

      One wonders if BiologySwede had not intended to not keep the leading/trailing N's, or not?

Re: Splitting only on internal pattern, not at start or end of string
by kcott (Archbishop) on Jan 16, 2014 at 15:53 UTC

    G'day BiologySwede,

    Welcome to the monastery.

    There's some issues with what you've posted:

    • You didn't state whether any sequences could contain no Ns. I've assumed the sequences might not have Ns. The code I've provided below can be shortened if that's not the case; however, it will work with either case as written.
    • You provided an example of a problematic sequence but didn't say whether:
      1. you didn't want the initial zero-length string that your split would produce, or
      2. you actually wanted to retain the leading (and/or trailing) Ns.
      I've assumed (1).

    The following code eliminates the need for an interim %sequence hash, requires no regex for split and reduces your code substantially (all processing occurs in a single statement). Also note that I've added some additional test data.

    #!/usr/bin/env perl -l use strict; use warnings; /^[^>]/ && do { y/N/ /; print join "\n" => split } while <DATA>; __DATA__ >fasta1 NNNAGTCTGCAAANAATTTGCGGCTCACAAT >fasta2 CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN >mytest1 NNNACGTNNTGCANN >mytest2 ACGTNNCGTANNNNNGTACNTACG >mytest3 TGCA

    Output:

    AGTCTGCAAA AATTTGCGGCTCACAAT CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGG ACGT TGCA ACGT CGTA GTAC TACG TGCA

    Here's some additional tips regarding the code you posted:

    • Hashes have no inherent ordering. "keys %sequence" will probably return a different order to that in the original fasta file. I don't know if that's important to you.
    • Get into the habit of using the 3-argument form of open with a lexical filehandle.
    • It's easy to forget to check for I/O errors (as you did with "open (OUTFILE,">fasta_report.txt");"). Consider using the autodie pragma: it's a lot less work for you and removes the possibility forgetting the I/O checks.

    -- Ken

Re: Splitting only on internal pattern, not at start or end of string
by robby_dobby (Hermit) on Jan 16, 2014 at 09:27 UTC

    Hello,

    Since you already know that your fasta string can only contain characters A,T,G,C other than N, the simplest way is to just use that bit of information. :-)

    Your regex is fine, but the problem is that it can match 'N' anywhere in the string. Here's how we can use our little tidbit to advantage. Change your regex to: [ATGC]N+[ATGC].

    Here's some sample code demonstrating it:

    use strict; use warnings; while (my $line = <DATA>) { chomp $line; # The below regex tells perl to look for # any of A,T,G,C followed by a string of # one or more Ns, followed by A,T,G,C. my @info = split /[ATGC]N+[ATGC]/, $line; print join(", ", @info), "\n"; } __DATA__ NNNAAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN AAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT

    Update: My split solution has a problem in that it loses one of [ATGC] on either side of the internal pattern. Please use this solution by hdb or johngg's extractive matching. If you prefer to use lookaround assertions, here's one by oiskuu.

      Did you realize that you lose one letter each side of the Ns from your sequence?

        Crap! What was I thinking? Yes, split is not the right solution for this situation.

        OP, apologies - please take johngg's solution. A global match is a better solution than mine.

        Update: added link to solution I was referring to
      That is totally awesome, many thanks!

        Unfortunately, also totally wrong as it will consume an A, C, G or T adjacent to the Ns. Rather than split do a global match for one or more of A, C, G or T.

        $ perl -E ' $seq = q{NNACGTNNNACGTNACGTNN}; say for split m{[ACGT]N+[ACGT]}, $seq; say q{-} x 10; say for $seq =~ m{[ACGT]+}g;' NNACG CG CGTNN ---------- ACGT ACGT ACGT $

        I hope this is helpful.

        Update: Corrected wording, s/more than one of/one or more of/

        Cheers,

        JohnGG

Re: Splitting only on internal pattern, not at start or end of string
by choroba (Cardinal) on Jan 16, 2014 at 15:32 UTC
    The problem appears when I also have Ns at the beginning or end of the sequence
    So, remove the N's at the beginning and at the end, and use your old algorithm:
    $sequence =~ s/^N+|N+$//g;
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      This is the one issue in this thread: the OP has not specified if he wants to keep leading and trailing Ns or whether he wants to have them removed. Some answers assume the former, others the latter....