comment on

Seems a regex is not the methode of choice here...

Provided my code has no bugs, see for yourself:

#!/usr/bin/perl
use strict;
use warnings;

use Benchmark;

use Data::Dumper;

my $DNAlength= 600000;
my $sequencelength= 3;
my $testfilename= 'DNA.tst';
my $bufsize= 1024;

# create testdata
open(my $testfile, ">$testfilename") or die "$testfilename: $!";
my $buffer = '';
for (my $i=$DNAlength; $i; --$i) {
    $buffer.= qw(A C G T)[rand 4];
    if (length($buffer) > $bufsize) {
        print $testfile $buffer;
        $buffer = '';
    }
}
print $testfile $buffer;
close $testfile;

a1: {
    # Start collecting data
    my %DNA_sequence_count;
    
    my $a1_0=new Benchmark;
    open($testfile, $testfilename) or die "$testfilename: $!";
    my $kept=0;
    while (1) {
        my $read_bytes = read($testfile, $buffer, $bufsize-$kept);
        last unless $read_bytes;
        my $end_buffer = $read_bytes-$sequencelength;
        for (my $i=0; $i <= $end_buffer; ++$i) {
            ++$DNA_sequence_count{substr($buffer, $i, $sequencelength)
+};
        }
        $kept= $sequencelength-1;
        $buffer= substr($buffer, $end_buffer+1, $kept);
    }
    close $testfile;
    my $a1_n=new Benchmark;
    
    print "the file code took:",timestr(timediff($a1_n, $a1_0)),"\n";
}

a2: {
    # Start collecting data
    my %DNA_sequence_count;
    
    my $a2_0=new Benchmark;
    open($testfile, $testfilename) or die "$testfilename: $!";
    my $allbuffered= <$testfile>;
    close $testfile;
    my $end_buffer= length($allbuffered)-$sequencelength+1;
    for (my $i=0; $i <= $end_buffer; ++$i) {
            ++$DNA_sequence_count{substr($allbuffered, $i, $sequencele
+ngth)};
    }
    my $a2_n=new Benchmark;

    print "the scalar code took:",timestr(timediff($a2_n, $a2_0)),"\n"
+;
}

a3: {
    # Start collecting data
    my %DNA_sequence_count;
    
    my $a3_0=new Benchmark;
    open($testfile, $testfilename) or die "$testfilename: $!";
    my $allbuffered= <$testfile>;
    close $testfile;
    ++$DNA_sequence_count{$1.$2} while $allbuffered=~ /(.)(?=(..))/g;
    my $a3_n=new Benchmark;

    print "the regex code took:",timestr(timediff($a3_n, $a3_0)),"\n";
}
[download]

Output: the file code took: 2 wallclock secs ( 0.96 usr + 0.04 sys = 1.00 CPU) the scalar code took: 3 wallclock secs ( 0.91 usr + 0.03 sys = 0.94 CPU) the regex code took: 6 wallclock secs ( 2.54 usr + 0.06 sys = 2.60 CPU)

Update: ikegami told me, my code is incomplete because: "...since it only lists groups of $sequencelength, not of $sequencelength or less."

that's true, but did it appear to you - and correct me if i'm wrong, that you can get the lower counts without counting each of the sequences!?
Simply loop once for each smaller length over all results and just count the last few sequences at the end which you didn't hit.

my (@keys)= keys %DNA_sequence_count;
foreach my $key (@keys) {
  for ($i=$sequencelength-1; $i; --$i) {
    $DNA_sequence_count{substr($key, 0, $i}+= $DNA_sequence_count{$key
+};
  }
}
# correction again thanks to ikegami!
for ($i=$sequencelength-1; $i; --$i) {
  # second correction thanks to sauoq
  for ($j=$i; $j<$sequencelength; ++$j) {
    ++$DNA_sequence_count{substr($allbuffered, -$j, $i)};
  }
}
[download]

Small example:

CATCAT
gives
CAT 2
ATC 1
TCA 1

Applying that algorithm you get
CA 2 (directly from CAT)
AT 2 (from ATC + AT at the end)
TC 1 (directly from TCA)
C 2
A 2 (from AT at the end)
T 2 (from TCA + last T)
[download]

Update: sauoq noticed there must be a bug in that correction algorithm and in fact, he's right. fixed now

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

In reply to Re: Question about speeding a regexp count by Skeeve
in thread Question about speeding a regexp count by Commander Salamander

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.