comment on

Please don't reply via updates!

> sorry for deleting my ps comment , i thought it is completely irrelevant since i found "readmore" tags and since it wasn't related to the initially problem. ok these are some data from some simulation experiments. there is only one file.1 and only one file.2. sizes: there ate approximately 3 mil intervals specified in file.1 and about 50 distinct id's and each id in file.2 has approximately 5-7 * 10^9 lines. cheers

thats a lot

> ps: also the clustering of intervals cannot be assumed.

so you are not restricted to 1..21 !?!

atomic sub-ranges

Well one fast way to go is to prepare and count in "atomic intervals", that are intervals whose elements always appear together, i.e. are never divided by other intervals.

for instance:

a1    12    15
a2    12    17
a3    13    14
a4    14    19
[download]

so atomic intervals are [12],[13],[14],[15],[16..17],[18..19]

Since they are disjunct you don't need nested loops to count there appearance. After reading all data for an id, you can calculate the real intervals by summing the atomics up, e.g. $sum_a3 = sum @count[14,15,16,18] (intervals identified by starting point)

This looks silly for the sample data you provided because its already very dense, you're the one who can judge...

hulls / implications / lattices

Another - similar - approach is to precalculate "minimal hulls" and "implications".

 
12 => a1 => a2
13 => a3 => a1 
14 => [14] => a3,a4
...
[download]

the hull of a point is found by cutting all ranges including that point, and often a hull is already identical to a range. (e.g. 18 => a4 )

Following the implications results in a minimum number off add-operations needed!

Both algorithms might look overly complicated but they scale very well, no matter if ranges are dense or sparse.

Count sparse points and sum hash-slice

A simplified (but less efficient) version of above algorithms was already shown with counting points and summing over array-slices.

If intervals are sparse, i.e. they don't cover many potential points, then rather count in a hash and sum over a hash-slice:

$sum{a1} = sum @count{12..15}

Implementation

... is left for exercise! =)

Cheers Rolf

( addicted to the Perl Programming Language)

In reply to Re: Counter - number of tags per interval by LanX
in thread Counter - number of tags per interval by baxy77bax

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.