I have a text file that has a DNA string in it. I want a way to isolate the DNA string and ignore the rest of the text file.

What's a DNA string? What's in the rest of the text that makes it differ from a DNA string so that we can ignore it? Clarity is the soul of programming.

The following code is intentionally pitched above what I think is the level of your current understanding of Perl in the hope that it may spur your curiosity.

Script extract_dna_seq_1.pl (runs under Perl version 5.8.9):

use warnings; use strict; my $base = qr{ [ATCGatcg] }xms; my $sequence = qr{ \b $base+ \b }xms; print "sequence '$1' \n\n" if 'GATTACA' =~ m{ ($sequence) }xms; my $text = q{ A DNA sequence consists in ATCG base pairs. Here's a AATCCGCTGATT sequence. Such sequences act to define protein structure after mediation by RNA transcription. }; my $n_captured = my @captures = $text =~ m{ $sequence }xmsg; printf "captured %d sequences: %s \n\n", $n_captured, join ' ', map qq{'$_'}, @captures; my $codon = qr{ $base{3} }xms; my $codons = qr{ \b $codon+ \b }xms; $n_captured = @captures = $text =~ m{ $codons }xmsg; printf "captured %d codon sequences: %s \n\n", $n_captured, join ' ', map qq{'$_'}, @captures;
Output:
c:\@Work\Perl\monks\undergradguy>perl extract_dna_seq_1.pl sequence 'GATTACA' captured 5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act' captured 2 codon sequences: 'AATCCGCTGATT' 'act'

The first attempt to define a DNA sequence regex
    my $base = qr{ [ATCGatcg] }xms;
    my $sequence = qr{ \b $base+ \b }xms;
extracts
    5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act'
from the text. Some of these, "A", "a" and "act", are clearly part of the text and not really DNA sequences. In an effort to refine sequence extraction, one can recognize that codon base-pair triplets are often the unit of interest and define
    my $codon  = qr{ $base{3} }xms;
    my $codons = qr{ \b $codon+ \b }xms;
to extract codon sequences instead. This reduces spurious output, but "act" is still recognied as a valid sequence while "ATCG" is not (because it's not an exact number of codons long!). Well, that's life. Better knowledge of the data will allow greater refinement of the matching regexes and better extraction performance.

Some random notes:

Others have posted links to much useful info; please take advantage of it. That's all for now from me; perhaps more later. Please don't hesitate to post any questions you may have.

Update: Slight wording, paragraph reorganization; should not be significant.


Give a man a fish:  <%-{-{-{-<


In reply to Re: how to isolate text in a text file. by AnomalousMonk
in thread how to isolate text in a text file. by undergradguy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.