Furor has asked for the wisdom of the Perl Monks concerning the following question:
So, what I want it to do is read in a file (fasta format) and compare (the beginning of) every sequence to some
specified strings (called a primer) and when there's a match, remove the matching string from the sequence.
The remaining trimmed sequences should be stored in a new file, and perhaps as a control, store non-matched
sequences in another file. This leads to another question. Do you have to store the processed data in an array
before writing it to an output-file, or can this be done directly?
Normally, the primers to be compared start at the beginning of the sequence, but, to exclude possible errors, it
might be useful to delete anything before the primer.
(btw, each sequence is preceded by a unique identifier key (indicated by ">"), so these should always remain
together)
Also, I was wondering if it'd make a difference in speed if it'd check the entire file (couple of thousand rows) separately for each specific primer, or if you go row by row and check every primer against it (so only going through the file once). Does this make sense? :o)
Don't mind the comments too much :)
Thanks in advance!
#! C:/Perl/bin use strict; use warnings; use File::Path; # This script processes a fasta file containing DNA sequences # Part 1: declare variables, constants, ... # forward (F) barcodes my @forward = ("AGCCTAAGCT", "TCAAGTTAGC", "AGCCTGGCAT", "ACGGTCCATG", "ACTTGCCGAT", "ACGGTGGATC", "ATCCGCCTAG", "ATGGCGGTAC"); # reverse (R) barcodes my @reverse = ("AGCTTAGGCT", "TAGCCTAAGC", "AGCTTGCCAT", "ACGTTCAATG", "ACTGGCGGAT", "ACGTTGAATC", "ATCGGCAAGT", "ATGCCGTTAC"); # primers used for Variable Region 1 (V1) and Variable Region 3 (V3) o +f 16S rRNA # forward primer (V1 region) my $V1 = 'AGAGTTTGATCCTGGCTCAG'; # reverse primer (V3 region) my $V3 = 'GTATTACCGCGGCTGCTGGCA'; # locate the import-file with data my $input_file = "C:/../input.txt"; # name the filehandler: FASTA_IN open (FASTA_IN, $input_file); # import data (fasta formatted style) as array to read all sequences my @raw_DNA = <FASTA_IN>; #test imported data #print "@raw_DNA\n"; # close the import-file close FASTA_IN; # Part 3: start processing sequences # 3.1 Create arrays to hold processed results my @Processed_Sequences = (); my @Rejected_Sequences = (); # 3.2 concatenate each barcode with apropriate primer for my $current_barcode(0..$#forward) { my $F = "$forward[$current_barcode]$V1\n"; #test concatenation # print $F; #test current concatenated barcode.primer against sequences and i +f match, #remove the barcode and primer # =~ m/$F/; #if match print match $F }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: remove part of string (DNA)
by moritz (Cardinal) on Apr 12, 2011 at 12:19 UTC | |
by tospo (Hermit) on Apr 12, 2011 at 16:17 UTC | |
|
Re: remove part of string (DNA)
by tospo (Hermit) on Apr 12, 2011 at 16:39 UTC | |
by tospo (Hermit) on Apr 12, 2011 at 16:43 UTC | |
by Furor (Novice) on Apr 13, 2011 at 08:00 UTC | |
by mrguy123 (Hermit) on Apr 13, 2011 at 11:10 UTC | |
|
Re: remove part of string (DNA)
by Generoso (Prior) on Apr 12, 2011 at 15:58 UTC | |
by Generoso (Prior) on Apr 12, 2011 at 16:17 UTC | |
|
Re: remove part of string (DNA)
by educated_foo (Vicar) on Apr 15, 2011 at 16:31 UTC | |
| A reply falls below the community's threshold of quality. You may see it by logging in. |