In the vein of perl liners and cmd line tools that do "simple" tasks and work with other liners to do more interesting tasks, I had put the following together to do just this sort of thing but have not run on really big files and I am sure this can be improved but it works great for fasta input files ~300K or so ...
perl -nle 'chomp;tr/a-z/A-Z/;next if /[^ATCG]/;$l=length();$c=0;for $m
+(5..$l){print substr($_,$c++,5)}' uniBNL_SSR_fasta.txt.input_library.
+txt | perl -nle '$c{$_}++ for split/\W/;}print map {"$_:$c{$_}\n"} so
+rt{$c{$b}<=>$c{$a}}keys %c;{' -
output example:
...
AGAGA:5334
GAGAG:5124
TCTCT:1938
AAAAA:1908
ACACA:1884
TTTTT:1851
CTCTC:1798
...
more details - for genome type data sets I split the bases up into N files and run individual jobs on N nodes of a cluster - if anyone is interested - here is what I use to split the large files into N files (thanks [id://thor] - custom version of what we discussed):
#!/usr/bin/perl
#
# simple fasta input file cleaver
# usage: cleaver.pl --format OUTPUT_FILE_PREFIX --number X input_file
# where X is the number of files to split input_file into
#
# note: you can optionally pass more than one input file
# to (re)combine into X files
use strict;
use warnings;
use Getopt::Long;
my ($number,$format)=(2,"output");
GetOptions(
"number=i" => \$number,
"format=s" => \$format,
);
my @output_file;
foreach my $num(1..$number) {
my $file ="$format.$num";
open($output_file[$num-1],">",$file) or die;
}
my $file_num = 0;
while(<>) {
$/='>';
chomp;
my $pre = ($. != 2 ? ">" : "");
print {$output_file[$file_num]} "$pre$_";
next if $. == 1;
$file_num +=1;
$file_num %= $number;
}
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.