ini2005 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I am working on large dna files and I want to replace all first and last occurrences of n/N. the files looks something like this:

acacccacacacaccacacccacacaccacacccacacccacacaccaca
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cccacaccacacccacacaccacacaccacacccacacccacacacacca
cacccacacaccacacccacacacaccctaaccctaacccctaaccccta
accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac
ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac
ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg
gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga
accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct
caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac
ggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnaccctaaccctaaaaccctaaccctagcc
ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc

it should look like that ('^' instead last/first):

acacccacacacaccacacccacacaccacacccacacccacacaccaca
^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn^
cccacaccacacccacacaccacacaccacacccacacccacacacacca
cacccacacaccacacccacacacaccctaaccctaacccctaaccccta
accctaaccc^nnnnnnnnnnnnnnnnnnnnnnnnn^ccctaaccctaac
ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac
ccctaac^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn^ggggg
gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga
accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct
caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac
gg^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnn^accctaaccctaaaaccctaaccctagcc
ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc
any Ideas how I can do it efficiently?

Replies are listed 'Best First'.
Re: replace fist and last occurrences of N
by jwkrahn (Abbot) on Jul 12, 2008 at 13:18 UTC

    Like this:   s/(?<=[^n])n|n(?=[^n])/^/ig

      Hey, thanks for quick reply! where do I put this line? I mean how do I use it? Thanks
        shell$ perl -pi.bak -le 's/(?<=[^n])n|n(?=[^n])/^/ig' yourfile.txt
        will do the substitution and leave a backup file named yourfile.txt.bak

        UPDATE: it will not work, because your file has newlines (somehow I missed it). Try this instead:

        perl -pi.bak -0377 -e 's/n([n\n]*)n/^$1^/ig' yourfile.txt
        []s, HTH, Massa
Re: replace fist and last occurrences of N
by linuxer (Curate) on Jul 12, 2008 at 13:23 UTC

    What does "large file" mean in your context? Of what file sizes are you speaking?

    I think this could be a solution for files less than 100 MB. Didn't test it... (with large files)

    #!/usr/bin/perl use strict; use warnings; { local $/; my $data = <DATA>; $data =~ s/((?:n+\n?)+n+)/replace($1)/gme; print $data, "\n"; sub replace { my $s = shift; substr( $s, 0, 1, '^' ); substr( $s, -1, 1, '^' ); return $s; } } __DATA__ acacccacacacaccacacccacacaccacacccacacccacacaccaca nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn cccacaccacacccacacaccacacaccacacccacacccacacacacca cacccacacaccacacccacacacaccctaaccctaacccctaaccccta accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac ggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnaccctaaccctaaaaccctaaccctagcc ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc
      Thanks, The files are up to 1G...

        Ok, what about newlines? Are there newlines in your datafiles? Or are they just one long string consisting of character class [a-zA-Z]?

        update:
        I know that for DNA information the character class can be much smaller ;o))

      massa's post made me think about my script. I wonder why I got stuck to the idea to let an extra subroutine do the replacement... and why I forgot everything about character classes...

      Here's an updated version of my script without an extra subroutine. All work is done by the regex. And the DNA data is now read from a file (so if your system has enough memory this might hopefully work for you).

      #!/usr/bin/perl use strict; use warnings; my $file = shift @ARGV; die "no dna file specified!\n" if !defined $file; open my $fh, '<', $file or die "$file: $!\n"; my $data = do { local $/; <$fh> }; close $fh; $data =~ s/n([nN\n]+)n/^$1^/gm; print $data, "\n"; __END__
Re: replace fist and last occurrences of N
by jethro (Monsignor) on Jul 12, 2008 at 13:38 UTC
    Another important question: What's the minimum length of n's? If it is less than 3 what should be done with 'nn' and especially 'n'?

    Never forget the edge cases.

      No!, all n sequences are > 30
Re: replace fist and last occurrences of N
by jethro (Monsignor) on Jul 12, 2008 at 14:43 UTC
    Try this if your files are too big for your memory:
    #!/usr/bin/perl -w use strict; use warnings; my $filename = shift; open my $fh,'<',$filename or die "No input file\n"; open my $ofh, '>', "$filename.new" or die "Can't open output file\n"; my $previousline= undef; while (<$fh>) { chomp; s/(?<=[^n])n|n(?=[^n])/^/ig; #borrowed from jwkrahn if (defined $previousline) { ($previousline=~/[^n]$/i) and (s/^n/^/i); (/^[^n]/i) and ($previousline=~s/n$/^/i); print $ofh $previousline,"\n"; } $previousline= $_; } print $ofh $previousline,"\n" if (defined $previousline); close $fh; close $ofh; rename "$filename.new",$filename;
    This creates a temporary copy of the data file so you have to make sure that enough space is on disk. Or preferably check the return values of the print statements for errors.

      This replaces the first and last n in each n run with ^ (rather than quoting the n run). It also "quotes" all n runs, not just the first and last as required by the OP.


      Perl is environmentally friendly - it saves trees
        Read again. The OP wanted a replacement and not a quote.

        I made sure now by making a diff between the output of my program and what he gave as wanted. Absolutely identical.

Re: replace fist and last occurrences of N
by johngg (Canon) on Jul 12, 2008 at 23:40 UTC
    This method reads the DNA file a line at a time in an infinite while loop, accumulating the lines in a buffer until the buffer doesn't end with an 'n'. Then a global replace of any first and last 'n's is done and the modified buffer is then printed to the output file and cleared. When the script detects that end-of-file has been found, it modifies and prints the buffer before exiting the infinite loop and closing the filehandles.

    use strict; use warnings; open my $dnaFH, q{<}, \ <<'EOD' or die qq{open: HEREDOC: $!\n}; acacccacacacaccacacccacacaccacacccacacccacacaccaca nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn cccacaccacacccacacaccacacaccacacccacacccacacacacca cacccacacaccacacccacacacaccctaaccctaacccctaaccccta accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac ggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnaccctaaccctaaaaccctaaccctagcc ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc gaccctgannnnnnncccccgaccccnnnnnnngccctaaccctaaccct gaccctgannnnnnnnnnnnnnctaacccgaacccgaacccgaaccnnnn EOD my $rxReplaceN = qr {(?xi) (?<![nN\n])n | n(?![nN\n]) | n(?=\n[^nN]) | n(?=\n\z) }; my $outFile = q{spw697207.out}; open my $outFH, q{>}, $outFile or die qq{open: > $outFile: $!\n}; my $acc = q{}; while ( 1 ) { emitAcc(), last if eof $dnaFH; $acc .= <$dnaFH>; emitAcc() unless $acc =~ m{[nN]$}; } close $dnaFH or die qq{close: HEREDOC: $!\n}; close $outFH or die qq{close: > $outFile: $!\n}; sub emitAcc { $acc =~ s{$rxReplaceN}{^}g; print $outFH $acc; $acc = q{}; }

    The output.

    acacccacacacaccacacccacacaccacacccacacccacacaccaca ^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn^ cccacaccacacccacacaccacacaccacacccacacccacacacacca cacccacacaccacacccacacacaccctaaccctaacccctaaccccta accctaaccc^nnnnnnnnnnnnnnnnnnnnnnnnn^ccctaaccctaac ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac ccctaac^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn^ggggg gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac gg^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnn^accctaaccctaaaaccctaaccctagcc ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc gaccctga^nnnnn^cccccgacccc^nnnnn^gccctaaccctaaccct gaccctga^nnnnnnnnnnnn^ctaacccgaacccgaacccgaacc^nn^

    I hope this is of interest.

    Cheers,

    JohnGG

Re: replace fist and last occurrences of N
by swampyankee (Parson) on Jul 12, 2008 at 14:03 UTC

    Well, if you've got enough memory you could read the entire file into memory and use a regex. Alternatively, you could read the file a character at a time (read the documentation for $/), and change the current character if it's /n/i and the previous one wasn't, or the previous character if it's /n/i and the current one isn't. I've got to run some errands, so I'll try to post some code later.


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

Re: replace fist and last occurrences of N
by GrandFather (Saint) on Jul 13, 2008 at 00:04 UTC

    Update: this is an answer to the wrong question! It solves the problem of quoting only the first and last n/N runs. The OP wants to replace the first and last N/n in each run with ^.

    The following (rather fussy because of special cases) code remembers where the last N/n run started then rewrites the last part of the output file with the markers inserted.

    use strict; use warnings; my $data = <<DATA; acacccacacacaccacacccacacaccacacccacacccacacaccaca nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn cccacaccacacccacacaccacacaccacacccacacccacacacacca cacccacacaccacacccacacacaccctaaccctaacccctaaccccta accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac ggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnaccctaaccctaaaaccctaaccctagcc ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc DATA my $result; open my $inFile, '<', \$data or die "Unable to open input fille: $!\n +"; open my $outFile, '>', \$result or die "Unable to create output file: +$1\n"; my $currStartN; my $lastStartN; my $lastEndN; my $prevLineEnd = 0; while (<$inFile>) { my $runEnd; if (defined $currStartN) { # Looking for an end N/n next unless /([^nN\n])/; # Continue if full line of N/n $runEnd = $-[0]; # Insert ^ for first run if ($runEnd == 0) { # Need to insert ^ at end of previous line seek $outFile, $prevLineEnd, 0; print $outFile "^\n"; } else { substr $_, $runEnd, 0, '^' unless defined $lastStartN; } } else { # Looking for a start N/n next unless /([nN]+)/; $runEnd = $+[0]; $currStartN = tell ($inFile) - length ($_) + $-[0]; s/([nN])/^$1/ unless defined $lastStartN; # Insert ^ for first + run next unless $runEnd < length ($_) - 2; # Continue if full line + of N/n } $lastStartN = $currStartN; $lastEndN = tell ($inFile) - length ($_) + $runEnd; $currStartN = undef; } continue { $prevLineEnd = tell $outFile; print $outFile $_; chomp; $prevLineEnd += length; } # Insert ^ around last run of N/n seek $outFile, $lastStartN + 2, 0; print $outFile '^'; seek $inFile, $lastStartN, 0; read $inFile, my $nRun, ($lastEndN - $lastStartN - 1); print $outFile $nRun, '^'; print $outFile $_ while <$inFile>; close $inFile; close $outFile; print $result;

    Prints:

    acacccacacacaccacacccacacaccacacccacacccacacaccaca ^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn^ cccacaccacacccacacaccacacaccacacccacacccacacacacca cacccacacaccacacccacacacaccctaaccctaacccctaaccccta accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac gg^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnn^naccctaaccctaaaaccctaaccctagcc ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc

    Perl is environmentally friendly - it saves trees