comment on

In the vein of perl liners and cmd line tools that do "simple" tasks and work with other liners to do more interesting tasks, I had put the following together to do just this sort of thing but have not run on really big files and I am sure this can be improved but it works great for fasta input files ~300K or so ...

perl -nle 'chomp;tr/a-z/A-Z/;next if /[^ATCG]/;$l=length();$c=0;for $m
+(5..$l){print substr($_,$c++,5)}' uniBNL_SSR_fasta.txt.input_library.
+txt | perl -nle '$c{$_}++ for split/\W/;}print map {"$_:$c{$_}\n"} so
+rt{$c{$b}<=>$c{$a}}keys %c;{' -
[download]

output example:

...
AGAGA:5334
GAGAG:5124
TCTCT:1938
AAAAA:1908
ACACA:1884
TTTTT:1851
CTCTC:1798
...
[download]

more details - for genome type data sets I split the bases up into N files and run individual jobs on N nodes of a cluster - if anyone is interested - here is what I use to split the large files into N files (thanks [id://thor] - custom version of what we discussed):

#!/usr/bin/perl
#
# simple fasta input file cleaver
# usage: cleaver.pl --format OUTPUT_FILE_PREFIX --number X input_file
# where X is the number of files to split input_file into
#
# note: you can optionally pass more than one input file 
#       to (re)combine into X files

use strict;
use warnings;

use Getopt::Long;

my ($number,$format)=(2,"output");
GetOptions(
    "number=i" => \$number,
    "format=s" => \$format,
);

my @output_file;
foreach my $num(1..$number) {
    my $file ="$format.$num";
    open($output_file[$num-1],">",$file) or die;
}

my $file_num = 0;
while(<>) {
    $/='>';
    chomp;
    my $pre = ($. != 2 ? ">" : ""); 
    print {$output_file[$file_num]} "$pre$_";
    next if $. == 1;
    $file_num +=1;
    $file_num %= $number;
}
[download]

In reply to Re: A better (problem-specific) perl hash? by l3v3l
in thread A better (problem-specific) perl hash? by srdst13

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.