comment on

I have a text file that has a DNA string in it. I want a way to isolate the DNA string and ignore the rest of the text file.

What's a DNA string? What's in the rest of the text that makes it differ from a DNA string so that we can ignore it? Clarity is the soul of programming.

The following code is intentionally pitched above what I think is the level of your current understanding of Perl in the hope that it may spur your curiosity.

Script extract_dna_seq_1.pl (runs under Perl version 5.8.9):

use warnings;
use strict;

my $base = qr{ [ATCGatcg] }xms;

my $sequence = qr{ \b $base+ \b }xms;

print "sequence '$1' \n\n" if 'GATTACA' =~ m{ ($sequence) }xms;

my $text = q{
A DNA sequence consists in ATCG base pairs.
Here's a AATCCGCTGATT sequence.
Such sequences act to define protein structure
after mediation by RNA transcription.
};

my $n_captured =
my @captures =
$text =~ m{ $sequence }xmsg;

printf "captured %d sequences: %s \n\n",
       $n_captured, join ' ', map qq{'$_'}, @captures;

my $codon  = qr{ $base{3} }xms;
my $codons = qr{ \b $codon+ \b }xms;

$n_captured =
@captures =
$text =~ m{ $codons }xmsg;

printf "captured %d codon sequences: %s \n\n",
       $n_captured, join ' ', map qq{'$_'}, @captures;
[download]

Output:

c:\@Work\Perl\monks\undergradguy>perl extract_dna_seq_1.pl
sequence 'GATTACA'

captured 5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act'

captured 2 codon sequences: 'AATCCGCTGATT' 'act'
[download]

The first attempt to define a DNA sequence regex
my $base = qr{ [ATCGatcg] }xms;
my $sequence = qr{ \b $base+ \b }xms;
extracts
5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act'
from the text. Some of these, "A", "a" and "act", are clearly part of the text and not really DNA sequences. In an effort to refine sequence extraction, one can recognize that codon base-pair triplets are often the unit of interest and define
my $codon = qr{ $base{3} }xms;
my $codons = qr{ \b $codon+ \b }xms;
to extract codon sequences instead. This reduces spurious output, but "act" is still recognied as a valid sequence while "ATCG" is not (because it's not an exact number of codons long!). Well, that's life. Better knowledge of the data will allow greater refinement of the matching regexes and better extraction performance.

Some random notes:

The statements
my $base = qr{ [ATCGatcg] }xms;
my $sequence = qr{ \b $base+ \b }xms;
illustrate the concept of "factoring" for regexes; factoring is also useful in designing functions and classes and is also a manifestation of the DRY principle.
The compound assignment statement
my $n_captured =
my @captures =
$text =~ m{ $sequence }xmsg;
captures a series of substrings from a match and then captures the number of substrings captured. What's going on here? For this, see:
- The behavior of a m//g match (see the /g modifier in perlop (update: m// and qr// discussed in Regexp Quote-Like Operators section)) in list context (as imposed by assignment to the array; see Context tutorial); and
- The behavior of an array (or list) in scalar context (imposed by assignment of the array to the scalar).

Others have posted links to much useful info; please take advantage of it. That's all for now from me; perhaps more later. Please don't hesitate to post any questions you may have.

Update: Slight wording, paragraph reorganization; should not be significant.

Give a man a fish: <%-{-{-{-<

In reply to Re: how to isolate text in a text file. by AnomalousMonk
in thread how to isolate text in a text file. by undergradguy

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.