in reply to how to isolate text in a text file.
I have a text file that has a DNA string in it. I want a way to isolate the DNA string and ignore the rest of the text file.
What's a DNA string? What's in the rest of the text that makes it differ from a DNA string so that we can ignore it? Clarity is the soul of programming.
The following code is intentionally pitched above what I think is the level of your current understanding of Perl in the hope that it may spur your curiosity.
Script extract_dna_seq_1.pl (runs under Perl version 5.8.9):
use warnings; use strict; my $base = qr{ [ATCGatcg] }xms; my $sequence = qr{ \b $base+ \b }xms; print "sequence '$1' \n\n" if 'GATTACA' =~ m{ ($sequence) }xms; my $text = q{ A DNA sequence consists in ATCG base pairs. Here's a AATCCGCTGATT sequence. Such sequences act to define protein structure after mediation by RNA transcription. }; my $n_captured = my @captures = $text =~ m{ $sequence }xmsg; printf "captured %d sequences: %s \n\n", $n_captured, join ' ', map qq{'$_'}, @captures; my $codon = qr{ $base{3} }xms; my $codons = qr{ \b $codon+ \b }xms; $n_captured = @captures = $text =~ m{ $codons }xmsg; printf "captured %d codon sequences: %s \n\n", $n_captured, join ' ', map qq{'$_'}, @captures;
c:\@Work\Perl\monks\undergradguy>perl extract_dna_seq_1.pl sequence 'GATTACA' captured 5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act' captured 2 codon sequences: 'AATCCGCTGATT' 'act'
The first attempt to define a DNA sequence regex
my $base = qr{ [ATCGatcg] }xms;
my $sequence = qr{ \b $base+ \b }xms;
extracts
5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act'
from the text. Some of these, "A", "a" and "act", are clearly part of the text and not really DNA sequences. In an effort to refine sequence extraction, one can recognize that codon base-pair triplets are often the unit of interest and define
my $codon = qr{ $base{3} }xms;
my $codons = qr{ \b $codon+ \b }xms;
to extract codon sequences instead. This reduces spurious output, but "act" is still recognied as a valid sequence while "ATCG" is not (because it's not an exact number of codons long!). Well, that's life. Better knowledge of the data will allow greater refinement of the matching regexes and better extraction performance.
Some random notes:
Update: Slight wording, paragraph reorganization; should not be significant.
Give a man a fish: <%-{-{-{-<
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: how to isolate text in a text file.
by undergradguy (Novice) on Dec 16, 2018 at 17:03 UTC | |
by undergradguy (Novice) on Dec 16, 2018 at 20:11 UTC | |
by Anonymous Monk on Dec 16, 2018 at 17:25 UTC |