Re: how to isolate text in a text file.

I have a text file that has a DNA string in it. I want a way to isolate the DNA string and ignore the rest of the text file.

What's a DNA string? What's in the rest of the text that makes it differ from a DNA string so that we can ignore it? Clarity is the soul of programming.

The following code is intentionally pitched above what I think is the level of your current understanding of Perl in the hope that it may spur your curiosity.

Script extract_dna_seq_1.pl (runs under Perl version 5.8.9):

use warnings;
use strict;

my $base = qr{ [ATCGatcg] }xms;

my $sequence = qr{ \b $base+ \b }xms;

print "sequence '$1' \n\n" if 'GATTACA' =~ m{ ($sequence) }xms;

my $text = q{
A DNA sequence consists in ATCG base pairs.
Here's a AATCCGCTGATT sequence.
Such sequences act to define protein structure
after mediation by RNA transcription.
};

my $n_captured =
my @captures =
$text =~ m{ $sequence }xmsg;

printf "captured %d sequences: %s \n\n",
       $n_captured, join ' ', map qq{'$_'}, @captures;

my $codon  = qr{ $base{3} }xms;
my $codons = qr{ \b $codon+ \b }xms;

$n_captured =
@captures =
$text =~ m{ $codons }xmsg;

printf "captured %d codon sequences: %s \n\n",
       $n_captured, join ' ', map qq{'$_'}, @captures;
[download]

Output:

c:\@Work\Perl\monks\undergradguy>perl extract_dna_seq_1.pl
sequence 'GATTACA'

captured 5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act'

captured 2 codon sequences: 'AATCCGCTGATT' 'act'
[download]

The first attempt to define a DNA sequence regex
my $base = qr{ [ATCGatcg] }xms;
my $sequence = qr{ \b $base+ \b }xms;
extracts
5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act'
from the text. Some of these, "A", "a" and "act", are clearly part of the text and not really DNA sequences. In an effort to refine sequence extraction, one can recognize that codon base-pair triplets are often the unit of interest and define
my $codon = qr{ $base{3} }xms;
my $codons = qr{ \b $codon+ \b }xms;
to extract codon sequences instead. This reduces spurious output, but "act" is still recognied as a valid sequence while "ATCG" is not (because it's not an exact number of codons long!). Well, that's life. Better knowledge of the data will allow greater refinement of the matching regexes and better extraction performance.

Some random notes:

The statements
my $base = qr{ [ATCGatcg] }xms;
my $sequence = qr{ \b $base+ \b }xms;
illustrate the concept of "factoring" for regexes; factoring is also useful in designing functions and classes and is also a manifestation of the DRY principle.
The compound assignment statement
my $n_captured =
my @captures =
$text =~ m{ $sequence }xmsg;
captures a series of substrings from a match and then captures the number of substrings captured. What's going on here? For this, see:
- The behavior of a m//g match (see the /g modifier in perlop (update: m// and qr// discussed in Regexp Quote-Like Operators section)) in list context (as imposed by assignment to the array; see Context tutorial); and
- The behavior of an array (or list) in scalar context (imposed by assignment of the array to the scalar).

Others have posted links to much useful info; please take advantage of it. That's all for now from me; perhaps more later. Please don't hesitate to post any questions you may have.

Update: Slight wording, paragraph reorganization; should not be significant.

Give a man a fish: <%-{-{-{-<

Comment on Re: how to isolate text in a text file. Select or Download Code

Replies are listed 'Best First'.
Re^2: how to isolate text in a text file. by undergradguy (Novice) on Dec 16, 2018 at 17:03 UTC
Wow! This is way more than I had hoped for monks. I was only looking for directions, I didn’t think the community would be this willing to help some dumb undergrad. Thank you all and this has more than soured my curiosity.	[reply]
Re^3: how to isolate text in a text file. by undergradguy (Novice) on Dec 16, 2018 at 20:11 UTC
Apologies that was typed on my phone while out.	[reply]
Re^3: how to isolate text in a text file. by Anonymous Monk on Dec 16, 2018 at 17:25 UTC
soured my curiosity ?	[reply]