comment on

Dear all,

I am working with biological sequence data and I have a need to split sequences up at regions of one or more Ns (the other characters in the sequence are A,C,G, and T). The format is called fasta, with a name of the sequence starting with a > and the sequence on the next row. Here is an example:

>name1
AAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT

This sequence would correctly be split into AAAATATGACAAAGGGGTT and GATGTCTGGTCAATAGGAT, and this I can manage (example code below).

The problem appears when I also have Ns at the beginning or end of the sequence. I do not want to split at these positions, only when I have Ns internally. This sequence for example would cause me problems:

>name1
NNNAAAATATGACAAAGGGGTTNNNNNNNNNNNNNNGATGTCTGGTCAATAGGAT

Could someone please help me get the regex right for the split function?

In my code I read a fasta-file (with multiple entries) in to a hash and then loop over the hash to produce the separate entries. See my code below

 
#!/usr/bin/perl

use warnings;
use strict;

my $infile=$ARGV[0];
my $header;
my %sequence=();

open FASTA, $infile or die "Couldn't open fasta-file";
open (OUTFILE,">fasta_report.txt");

# Populate a hash with the fasta-data
# fasta example:
#>fasta1
#NNNAGTCTGCAAANAATTTGCGGCTCACAAT
#>fasta2
#CGCAGCCATTAACATCTCAACAAGCCAAAAATTCCTTCTCAGAAATTCGGNNN

while (<FASTA>) {
  chomp;
  if (/^>(.*)$/){
    $header=$1;
  }
  elsif (/^(\S+)$/){
    $sequence{$header} .= $1 if $header;
  }
}
close FASTA;

#Go through the hash and split at Ns
foreach my $key (keys %sequence){ 
  my @contigs = split (/N+/, $sequence{$key});
  foreach my $element (@contigs){
    print OUTFILE "$element\n";
  }
}  
close OUTFILE;
[download]

I very much appreciate any help you can give me. Thanks!

In reply to Splitting only on internal pattern, not at start or end of string by BiologySwede

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.