in reply to remove part of string (DNA)

Since reading and writing is probably the slowest part of your program, it makes sense to read the file only once. And because the file might be big, you shouldn't read it all at once into memory, but rather line by line.

Here's something to get you started. Since you didn't include any example input and expected output data, I can't really help you with the processing step, but maybe it helps anyway:

#! C:/Perl/bin use strict; use warnings; use File::Path; # This script processes a fasta file containing DNA sequences # Part 1: declare variables, constants, ... # forward (F) barcodes my @forward = ("AGCCTAAGCT", "TCAAGTTAGC", "AGCCTGGCAT", "ACGGTCCATG", "ACTTGCCGAT", "ACGGTGGATC", "ATCCGCCTAG", "ATGGCGGTAC"); # reverse (R) barcodes my @reverse = ("AGCTTAGGCT", "TAGCCTAAGC", "AGCTTGCCAT", "ACGTTCAATG", "ACTGGCGGAT", "ACGTTGAATC", "ATCGGCAAGT", "ATGCCGTTAC"); # primers used for Variable Region 1 (V1) and Variable Region 3 (V3) o +f 16S rRNA # forward primer (V1 region) my $V1 = 'AGAGTTTGATCCTGGCTCAG'; # reverse primer (V3 region) my $V3 = 'GTATTACCGCGGCTGCTGGCA'; # locate the import-file with data my $input_file = "C:/../input.txt"; # concatenate primer to bar codes: my @processed = map { $_ . $V1 } @forwards; # construct a regex to search for my $search_for = join '|', @processed; # compile it: $search_for = qr{$search_for}; open my $FASTA_IN, '<', $input_file or die $!; open my $MATCHED_OUT, '>', 'matched.txt' or die $! open my $NOT_MATCHED_OUT, '>', 'notmatched.txt' or die $' # don't read all the data at once, # rather process it line by line while (my $line = <$FASTA_IN>) { if ($line =~ $search_for) { print $MATCHED_OUT $line; } else { print $NOT_MATCHED_OUT $line; } } close $FASTA_IN; close $MATCHED_OUT; close $NOT_MATCHED_OUT;

Replies are listed 'Best First'.
Re^2: remove part of string (DNA)
by tospo (Hermit) on Apr 12, 2011 at 16:17 UTC
    I think it should be
    my $search_for = '('.join( '|', @processed).')';
    i.e. add round brackets around the pattern group, right?