comment on

In the vein of perl liners and cmd line tools that do "simple" tasks and work with other liners to do more interesting tasks, I had put the following together to do just this sort of thing but have not run on really big files and I am sure this can be improved but it works great for fasta input files ~300K or so ...

perl -nle 'chomp;tr/a-z/A-Z/;next if /[^ATCG]/;$l=length();$c=0;for $m
+(5..$l){print substr($_,$c++,5)}' uniBNL_SSR_fasta.txt.input_library.
+txt | perl -nle '$c{$_}++ for split/\W/;}print map {"$_:$c{$_}\n"} so
+rt{$c{$b}<=>$c{$a}}keys %c;{' -
[download]

output example:

...
AGAGA:5334
GAGAG:5124
TCTCT:1938
AAAAA:1908
ACACA:1884
TTTTT:1851
CTCTC:1798
...
[download]

more details - for genome type data sets I split the bases up into N files and run individual jobs on N nodes of a cluster - if anyone is interested - here is what I use to split the large files into N files (thanks [id://thor] - custom version of what we discussed):

#!/usr/bin/perl
#
# simple fasta input file cleaver
# usage: cleaver.pl --format OUTPUT_FILE_PREFIX --number X input_file
# where X is the number of files to split input_file into
#
# note: you can optionally pass more than one input file 
#       to (re)combine into X files

use strict;
use warnings;

use Getopt::Long;

my ($number,$format)=(2,"output");
GetOptions(
    "number=i" => \$number,
    "format=s" => \$format,
);

my @output_file;
foreach my $num(1..$number) {
    my $file ="$format.$num";
    open($output_file[$num-1],">",$file) or die;
}

my $file_num = 0;
while(<>) {
    $/='>';
    chomp;
    my $pre = ($. != 2 ? ">" : ""); 
    print {$output_file[$file_num]} "$pre$_";
    next if $. == 1;
    $file_num +=1;
    $file_num %= $number;
}
[download]

In reply to Re: A better (problem-specific) perl hash? by l3v3l
in thread A better (problem-specific) perl hash? by srdst13

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks