Re: remove part of string (DNA)

I would recommend to use BioPerl for reading sequences. Even a simple format like FASTA has a few caveats and you make your life easier in the long run if you use those libraries.
This is the canonical way of readin sequences from a FASTA file:

use Bio::SeqIO;
my $seqio_obj = Bio::SeqIO->new(-file => "sequence.fasta", -format => 
+"fasta" );

while (my $seq_obj = $seqio_obj->next_seq){
  print "Sequence id: ".$seq_obj->display_id."\n";
  print "Sequence: "$seq_obj->seq."\n";
}
[download]

An important question is whether you need both primer/adapters (forward and reverse) to match (presumably either side of a sequenced insert)? If so, you will also need to reverse-complement one of the primers or you have to check both in both orientations unless you are sequencing from the forward primer. To match and remove a string you can use s//. Here is a Perl one-liner that illsutrates how you can test if something matches and remove it at the same time:

perl -e '$s="get_rid_of_some_string";print "remaining string: $s\n" if
+ $s=~s/^get_rid_//;'
[download]

Just run the above line in a terminal to see it in action. It will reduce "get_rid_of_some_string" to "of_some_string" if "get_rid_" can be found (and removed) from teh beginning.
To remove anything before the pattern as well (e.g. any sequence before a primer), you can do it like this:

perl -e '$s="get_rid_of_some_string";print "remaining string: $s\n" if
+ $s=~s/^.*rid_//;'
[download]

This will have the same result as above but the pattern was just "rid_" and the regular expression says: remove everything from teh beginning of the string, optionally followed by some characters, then followed by "rid_", with nothing (i.e. just strip it off). Hope this helps to get you started.

Comment on Re: remove part of string (DNA) Select or Download Code

Replies are listed 'Best First'.
Re^2: remove part of string (DNA) by tospo (Hermit) on Apr 12, 2011 at 16:43 UTC
oh and I forgot to mention: I wouldn't worry too much about memory and efficiency of the algorithm unless you plan to implement this on a web server or something where it will run all the time. It's more likely to be a one-off and, especially as a beginner, you can easily spend much more time optimising it than you will ever save. If it takes a bit longer, use the time to brew a nice cup of coffee or go home and find the shiny new results waiting on your disk for you in the morning... :-)	[reply]
Re^3: remove part of string (DNA) by Furor (Novice) on Apr 13, 2011 at 08:00 UTC
Thanks to all for the replies! I'll check them out under close scrutiny. I'm sure it'll give me a boost. (to be continued ;) )	[reply]
Re^4: remove part of string (DNA) by mrguy123 (Hermit) on Apr 13, 2011 at 11:10 UTC
Hi Furor, good luck with your research! As a fellow biologist working with Perl, I highly recommend having a few ready modules for reading and writing Fasta files. It makes everything so much easier You can find some great stuff in CPAN, but also if you need simple things it is good exercise to write it yourself. Here is a short module that reads Fasta files into a hash that might come in handy: package fasta_utils; use Exporter 'import'; @EXPORT_OK = qw(fasta2hash); use strict; use warnings; use lib '/cs/prt/mrguy/lib'; ##Reading the fasta formats of the file into hash sub fasta2hash { my ($file) = @_; print "file = $file\n"; my $seq_ref = {}; open IN, $file; my ($sequence,$id); while (my $line = <IN>){ if ($line =~/^\s$/){ next; } if ($line =~ /^>(.?)$/){ if ($id){ $seq_ref->{$id} = $sequence; } $id = $1; $sequence = ""; } else { chomp $line; $line =~ s/\s//g; $sequence .= $line; } } ##Putting in the last sequence if ($id){ $seq_ref->{$id} = $sequence; } return $seq_ref; } 1; [download] Hopes this helps in the future Mr Guy	[reply] [d/l]