in reply to how to isolate text in a text file.

I have a text file that has a DNA string in it. I want a way to isolate the DNA string and ignore the rest of the text file.

What's a DNA string? What's in the rest of the text that makes it differ from a DNA string so that we can ignore it? Clarity is the soul of programming.

The following code is intentionally pitched above what I think is the level of your current understanding of Perl in the hope that it may spur your curiosity.

Script extract_dna_seq_1.pl (runs under Perl version 5.8.9):

use warnings; use strict; my $base = qr{ [ATCGatcg] }xms; my $sequence = qr{ \b $base+ \b }xms; print "sequence '$1' \n\n" if 'GATTACA' =~ m{ ($sequence) }xms; my $text = q{ A DNA sequence consists in ATCG base pairs. Here's a AATCCGCTGATT sequence. Such sequences act to define protein structure after mediation by RNA transcription. }; my $n_captured = my @captures = $text =~ m{ $sequence }xmsg; printf "captured %d sequences: %s \n\n", $n_captured, join ' ', map qq{'$_'}, @captures; my $codon = qr{ $base{3} }xms; my $codons = qr{ \b $codon+ \b }xms; $n_captured = @captures = $text =~ m{ $codons }xmsg; printf "captured %d codon sequences: %s \n\n", $n_captured, join ' ', map qq{'$_'}, @captures;
Output:
c:\@Work\Perl\monks\undergradguy>perl extract_dna_seq_1.pl sequence 'GATTACA' captured 5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act' captured 2 codon sequences: 'AATCCGCTGATT' 'act'

The first attempt to define a DNA sequence regex
    my $base = qr{ [ATCGatcg] }xms;
    my $sequence = qr{ \b $base+ \b }xms;
extracts
    5 sequences: 'A' 'ATCG' 'a' 'AATCCGCTGATT' 'act'
from the text. Some of these, "A", "a" and "act", are clearly part of the text and not really DNA sequences. In an effort to refine sequence extraction, one can recognize that codon base-pair triplets are often the unit of interest and define
    my $codon  = qr{ $base{3} }xms;
    my $codons = qr{ \b $codon+ \b }xms;
to extract codon sequences instead. This reduces spurious output, but "act" is still recognied as a valid sequence while "ATCG" is not (because it's not an exact number of codons long!). Well, that's life. Better knowledge of the data will allow greater refinement of the matching regexes and better extraction performance.

Some random notes:

Others have posted links to much useful info; please take advantage of it. That's all for now from me; perhaps more later. Please don't hesitate to post any questions you may have.

Update: Slight wording, paragraph reorganization; should not be significant.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: how to isolate text in a text file.
by undergradguy (Novice) on Dec 16, 2018 at 17:03 UTC
    Wow! This is way more than I had hoped for monks. I was only looking for directions, I didn’t think the community would be this willing to help some dumb undergrad. Thank you all and this has more than soured my curiosity.
      Apologies that was typed on my phone while out.
      soured my curiosity

      ?