in reply to replace fist and last occurrences of N

What does "large file" mean in your context? Of what file sizes are you speaking?

I think this could be a solution for files less than 100 MB. Didn't test it... (with large files)

#!/usr/bin/perl use strict; use warnings; { local $/; my $data = <DATA>; $data =~ s/((?:n+\n?)+n+)/replace($1)/gme; print $data, "\n"; sub replace { my $s = shift; substr( $s, 0, 1, '^' ); substr( $s, -1, 1, '^' ); return $s; } } __DATA__ acacccacacacaccacacccacacaccacacccacacccacacaccaca nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn cccacaccacacccacacaccacacaccacacccacacccacacacacca cacccacacaccacacccacacacaccctaaccctaacccctaaccccta accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac ggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnaccctaaccctaaaaccctaaccctagcc ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc

Replies are listed 'Best First'.
Re^2: replace fist and last occurrences of N
by ini2005 (Novice) on Jul 12, 2008 at 13:30 UTC
    Thanks, The files are up to 1G...

      Ok, what about newlines? Are there newlines in your datafiles? Or are they just one long string consisting of character class [a-zA-Z]?

      update:
      I know that for DNA information the character class can be much smaller ;o))

        yes, there are new lines
        the files looks exactly like in your example (DATA)
        thses are dna files of whole genomes and they are quite large....

        update
        and, it might also be capital N not just n
Re^2: replace fist and last occurrences of N
by linuxer (Curate) on Jul 13, 2008 at 20:17 UTC

    massa's post made me think about my script. I wonder why I got stuck to the idea to let an extra subroutine do the replacement... and why I forgot everything about character classes...

    Here's an updated version of my script without an extra subroutine. All work is done by the regex. And the DNA data is now read from a file (so if your system has enough memory this might hopefully work for you).

    #!/usr/bin/perl use strict; use warnings; my $file = shift @ARGV; die "no dna file specified!\n" if !defined $file; open my $fh, '<', $file or die "$file: $!\n"; my $data = do { local $/; <$fh> }; close $fh; $data =~ s/n([nN\n]+)n/^$1^/gm; print $data, "\n"; __END__