comment on

I've got a few nits to pick with your first method. I'd pick different nits with your second method but, as you point out, your second method doesn't handle overlaps and, because of that, probably doesn't fit the requirements. So I'll focus on your first one.

while (<DATA>) {
    for my $seq (keys %seqs) {
        $seqs{$seq}->{'count'}++ for m/$seqs{$seq}->{'re'}/g;
    }
}
[download]

What if the file is one 100MB line? (I'm not a biologist or anything but I am under the impression that strands of DNA are very long.) If the data you are searching through is a 100 megs, you've just gone through 100 megs N times, where N is the number of patterns you are searching for. Additionally, optimizing your regex buys you nothing if you are only searching for the pattern once in one long string of input.

Ok, maybe I'm wrong about the number of lines. He said it was a "very large file" but he didn't specify much more than that. So, is your method better if you have a file with many shorter lines? Not very. Compiling the regex patterns ahead of time buys you a little. But consider 100 lines, each 1MB long. In that case, you search each of those lines N times. In other words, you again search through all 100MB N times.

I don't propose that the solution I posted above at Re: Quickest method for matching is at all the best but I suspect it would be noticeably more efficient than your method with real input.

Of course, there are many optimizations that could be used to improve my code too. I could read in a larger buffer at once, thereby reducing the number of reads. I could stop looking for patterns once they've already been found twice. I could refrain from searching across newline boundaries if I expected my DNA data to be composed of many lines. (Notice that newlines don't break the algorithm, they just create a bit of useless work.) Or I could code it up in C because, in this case, Perl is really only buying me a little convenience.

-sauoq
"My two cents aren't worth a dime.";

In reply to Re: Re: Quickest method for matching by sauoq
in thread Quickest method for matching by dr_jgbn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.