Re: replace fist and last occurrences of N
by jwkrahn (Abbot) on Jul 12, 2008 at 13:18 UTC
|
| [reply] [d/l] |
|
|
Hey,
thanks for quick reply!
where do I put this line? I mean how do I use it?
Thanks
| [reply] |
|
|
shell$ perl -pi.bak -le 's/(?<=[^n])n|n(?=[^n])/^/ig' yourfile.txt
will do the substitution and leave a backup file named yourfile.txt.bak
UPDATE: it will not work, because your file has newlines (somehow I missed it). Try this instead:
perl -pi.bak -0377 -e 's/n([n\n]*)n/^$1^/ig' yourfile.txt
| [reply] [d/l] [select] |
Re: replace fist and last occurrences of N
by linuxer (Curate) on Jul 12, 2008 at 13:23 UTC
|
What does "large file" mean in your context? Of what file sizes are you speaking?
I think this could be a solution for files less than 100 MB. Didn't test it... (with large files)
#!/usr/bin/perl
use strict;
use warnings;
{
local $/;
my $data = <DATA>;
$data =~ s/((?:n+\n?)+n+)/replace($1)/gme;
print $data, "\n";
sub replace {
my $s = shift;
substr( $s, 0, 1, '^' );
substr( $s, -1, 1, '^' );
return $s;
}
}
__DATA__
acacccacacacaccacacccacacaccacacccacacccacacaccaca
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cccacaccacacccacacaccacacaccacacccacacccacacacacca
cacccacacaccacacccacacacaccctaaccctaacccctaaccccta
accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac
ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac
ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg
gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga
accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct
caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac
ggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnaccctaaccctaaaaccctaaccctagcc
ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc
| [reply] [d/l] |
|
|
Thanks,
The files are up to 1G...
| [reply] |
|
|
Ok, what about newlines? Are there newlines in your datafiles? Or are they just one long string consisting of character class [a-zA-Z]?
update:
I know that for DNA information the character class can be much smaller ;o))
| [reply] |
|
|
|
|
massa's post made me think about my script.
I wonder why I got stuck to the idea to let an extra subroutine do the replacement... and why I forgot everything about character classes...
Here's an updated version of my script without an extra subroutine. All work is done by the regex. And the DNA data is now read from a file (so if your system has enough memory this might hopefully work for you).
#!/usr/bin/perl
use strict;
use warnings;
my $file = shift @ARGV;
die "no dna file specified!\n" if !defined $file;
open my $fh, '<', $file or die "$file: $!\n";
my $data = do { local $/; <$fh> };
close $fh;
$data =~ s/n([nN\n]+)n/^$1^/gm;
print $data, "\n";
__END__
| [reply] [d/l] |
Re: replace fist and last occurrences of N
by jethro (Monsignor) on Jul 12, 2008 at 13:38 UTC
|
| [reply] |
|
|
No!, all n sequences are > 30
| [reply] |
Re: replace fist and last occurrences of N
by jethro (Monsignor) on Jul 12, 2008 at 14:43 UTC
|
Try this if your files are too big for your memory:
#!/usr/bin/perl -w
use strict;
use warnings;
my $filename = shift;
open my $fh,'<',$filename or die "No input file\n";
open my $ofh, '>', "$filename.new" or die "Can't open output file\n";
my $previousline= undef;
while (<$fh>) {
chomp;
s/(?<=[^n])n|n(?=[^n])/^/ig; #borrowed from jwkrahn
if (defined $previousline) {
($previousline=~/[^n]$/i) and (s/^n/^/i);
(/^[^n]/i) and ($previousline=~s/n$/^/i);
print $ofh $previousline,"\n";
}
$previousline= $_;
}
print $ofh $previousline,"\n" if (defined $previousline);
close $fh; close $ofh;
rename "$filename.new",$filename;
This creates a temporary copy of the data file so you have to make sure that enough space is on disk. Or preferably check the return values of the print statements for errors.
| [reply] [d/l] |
|
|
| [reply] |
|
|
| [reply] |
|
|
|
|
Re: replace fist and last occurrences of N
by johngg (Canon) on Jul 12, 2008 at 23:40 UTC
|
This method reads the DNA file a line at a time in an infinite while loop, accumulating the lines in a buffer until the buffer doesn't end with an 'n'. Then a global replace of any first and last 'n's is done and the modified buffer is then printed to the output file and cleared. When the script detects that end-of-file has been found, it modifies and prints the buffer before exiting the infinite loop and closing the filehandles.
use strict;
use warnings;
open my $dnaFH, q{<}, \ <<'EOD' or die qq{open: HEREDOC: $!\n};
acacccacacacaccacacccacacaccacacccacacccacacaccaca
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cccacaccacacccacacaccacacaccacacccacacccacacacacca
cacccacacaccacacccacacacaccctaaccctaacccctaaccccta
accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac
ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac
ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg
gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga
accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct
caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac
ggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnaccctaaccctaaaaccctaaccctagcc
ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc
gaccctgannnnnnncccccgaccccnnnnnnngccctaaccctaaccct
gaccctgannnnnnnnnnnnnnctaacccgaacccgaacccgaaccnnnn
EOD
my $rxReplaceN = qr
{(?xi)
(?<![nN\n])n
|
n(?![nN\n])
|
n(?=\n[^nN])
|
n(?=\n\z)
};
my $outFile = q{spw697207.out};
open my $outFH, q{>}, $outFile
or die qq{open: > $outFile: $!\n};
my $acc = q{};
while ( 1 )
{
emitAcc(), last if eof $dnaFH;
$acc .= <$dnaFH>;
emitAcc() unless $acc =~ m{[nN]$};
}
close $dnaFH or die qq{close: HEREDOC: $!\n};
close $outFH or die qq{close: > $outFile: $!\n};
sub emitAcc
{
$acc =~ s{$rxReplaceN}{^}g;
print $outFH $acc;
$acc = q{};
}
The output.
acacccacacacaccacacccacacaccacacccacacccacacaccaca
^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn^
cccacaccacacccacacaccacacaccacacccacacccacacacacca
cacccacacaccacacccacacacaccctaaccctaacccctaaccccta
accctaaccc^nnnnnnnnnnnnnnnnnnnnnnnnn^ccctaaccctaac
ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac
ccctaac^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn^ggggg
gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga
accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct
caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac
gg^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnn^accctaaccctaaaaccctaaccctagcc
ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc
gaccctga^nnnnn^cccccgacccc^nnnnn^gccctaaccctaaccct
gaccctga^nnnnnnnnnnnn^ctaacccgaacccgaacccgaacc^nn^
I hope this is of interest. Cheers, JohnGG | [reply] [d/l] [select] |
Re: replace fist and last occurrences of N
by swampyankee (Parson) on Jul 12, 2008 at 14:03 UTC
|
Well, if you've got enough memory you could read the entire file into memory and use a regex. Alternatively, you could read the file a character at a time (read the documentation for $/), and change the current character if it's /n/i and the previous one wasn't, or the previous character if it's /n/i and the current one isn't. I've got to run some errands, so I'll try to post some code later.
Information about American English usage here and here. Floating point issues? Please read this before posting. — emc
| [reply] |
Re: replace fist and last occurrences of N
by GrandFather (Saint) on Jul 13, 2008 at 00:04 UTC
|
Update: this is an answer to the wrong question! It solves the problem of quoting only the first and last n/N runs. The OP wants to replace the first and last N/n in each run with ^.
The following (rather fussy because of special cases) code remembers where the last N/n run started then rewrites the last part of the output file with the markers inserted.
use strict;
use warnings;
my $data = <<DATA;
acacccacacacaccacacccacacaccacacccacacccacacaccaca
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cccacaccacacccacacaccacacaccacacccacacccacacacacca
cacccacacaccacacccacacacaccctaaccctaacccctaaccccta
accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac
ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac
ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg
gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga
accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct
caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac
ggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnaccctaaccctaaaaccctaaccctagcc
ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc
DATA
my $result;
open my $inFile, '<', \$data or die "Unable to open input fille: $!\n
+";
open my $outFile, '>', \$result or die "Unable to create output file:
+$1\n";
my $currStartN;
my $lastStartN;
my $lastEndN;
my $prevLineEnd = 0;
while (<$inFile>) {
my $runEnd;
if (defined $currStartN) {
# Looking for an end N/n
next unless /([^nN\n])/; # Continue if full line of N/n
$runEnd = $-[0];
# Insert ^ for first run
if ($runEnd == 0) {
# Need to insert ^ at end of previous line
seek $outFile, $prevLineEnd, 0;
print $outFile "^\n";
} else {
substr $_, $runEnd, 0, '^' unless defined $lastStartN;
}
} else {
# Looking for a start N/n
next unless /([nN]+)/;
$runEnd = $+[0];
$currStartN = tell ($inFile) - length ($_) + $-[0];
s/([nN])/^$1/ unless defined $lastStartN; # Insert ^ for first
+ run
next unless $runEnd < length ($_) - 2; # Continue if full line
+ of N/n
}
$lastStartN = $currStartN;
$lastEndN = tell ($inFile) - length ($_) + $runEnd;
$currStartN = undef;
} continue {
$prevLineEnd = tell $outFile;
print $outFile $_;
chomp;
$prevLineEnd += length;
}
# Insert ^ around last run of N/n
seek $outFile, $lastStartN + 2, 0;
print $outFile '^';
seek $inFile, $lastStartN, 0;
read $inFile, my $nRun, ($lastEndN - $lastStartN - 1);
print $outFile $nRun, '^';
print $outFile $_ while <$inFile>;
close $inFile;
close $outFile;
print $result;
Prints:
acacccacacacaccacacccacacaccacacccacacccacacaccaca
^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn^
cccacaccacacccacacaccacacaccacacccacacccacacacacca
cacccacacaccacacccacacacaccctaaccctaacccctaaccccta
accctaacccnnnnnnnnnnnnnnnnnnnnnnnnnnnccctaaccctaac
ccctaaccctaaccctaaccgtaaccctaaccctttaccctaacccgaac
ccctaacnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnggggg
gaccctgaccgtgaccctgaccctaacccgaacccgaacccgaaccccga
accccgaaccccgaaccccaaccccaaccccaaccccaaccctaacccct
caccctcaccctcgacccccgacccccgacccccgacccccaccccgaac
gg^nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnn^naccctaaccctaaaaccctaaccctagcc
ctagccctagccctagccctaacccctaacccctaaccctaagccgaagc
Perl is environmentally friendly - it saves trees
| [reply] [d/l] [select] |