comment on

Educated_foo,

As you rightly pointed out, I am indeed working on biological data. Its a fastq format file. The second line, of every 4 lines; 1:4,5:8 etc.., is a sequence and I am looking for certain patterns and I have to remove them if found.

I did exactly the same as you have mentioned here. However, there's 1 other optimization possible, if you know that there are not going to be that many matches. I have about 20 million reads (sequences) and I know for a fact that there can't be more than 1 million. In this case, I decided to split my substring into 2 parts:

sub1 = "first half"

sub2 = "second half"

Now, with this condition,

if ( $seq !~ m/$sub1/ && $seq !~ m/$sub2/ ) {
    # this means there are at least 2 mismatches
    # the substring you are looking for is not here
    # so don't check for any patterns, just "next;"
}
[download]

I guess this doesn't mean much if your data is small or if the substrings occur too often. But it does result in faster code by about 8-10x times.

Thanks for the tip regarding trying for more mismatches. I have always wanted to code for suffix arrays. Now may be the right time to experiment, that I have a huge data in my hands.

Thanks once again for all your valuable opinions!

In reply to Re^2: generating hash patterns for searching with one mismatch by cedance
in thread generating hash patterns for searching with one mismatch by cedance

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.