Re^3: Optimizing a double loop

Let me describe the complete scenario and try to make things more clear.

I start by generating 1000 sets of ranges. Generating the sets is done using some other data that needs parsing etc., but it is done only once. From now on, the sets are fixed and I do not change them.

I chose to store each set of ranges as a hash with key= range start point, value = array of lengths of all ranges that start at that start point. For example, the set of ranges ((100,200),(50,70),(50,100)) will be stored as: {100=>[100], 50=>[20,50]}

Each of these sets is stored to disk. I can then generate a "counts array" by retrieving the required set, converting it into an inversion list (as moritz so kindly suggested) and iterating the whole range once. This works great. I can also do the same for each of the 1000 sets and aggregate the results for each position, then derive the mean and standard deviation for each position in the whole range.

Now, there is another kind of queries I need to answer. Given some condition on ranges, e.g. "start point < x && end point > y", I would like to generate the "counts array" after removing all ranges that answer the condition. So I load the original sets, remove all ranges that apply, and continue through the pipeline. Obviously, I do not store the filtered sets (the filtering was only done for the purpose of the query).

One last type of query is a bit different - instead of generating a "counts array", I would like to tell how many ranges answer some condition (e.g. as the e.g. before). This is also why I think I have to keep the original set and not just the inversion list, which I, by the way, do not store but generate on the fly when "counts array" queries are run.

To conclude, I think I do need to save the original sets on disk, since generating the sets from scratch takes some time and requires reading loading other data as I mentioned in the beginning. I think it makes sense to generate the sets once, then load + filter/answer queries.

But perhaps I can store the sets more efficiently? Maybe storing them all in one file is better?

I should note that some sets contain up to 80000 keys (start points) and take 1.5MB when stored (using nstore), so putting together 1000 sets like this might not be such a good idea.

As always, your ideas are welcomed. I will be more than happy to supply more clarifications.

Thank you all.

Comment on Re^3: Optimizing a double loop Download Code

Replies are listed 'Best First'.
Re^4: Optimizing a double loop by BrowserUk (Patriarch) on Jun 04, 2010 at 14:52 UTC
But perhaps I can store the sets more efficiently? Maybe storing them all in one file is better? Given a file 4GB containing 1000 sets of 1 million integers stored in packed binary form, the following Inline::C code accumulated the sums and sums of squares in just under 40 seconds. With some more effort that could be parallelised, but it probably isn't necessary: #! perl -slw use 5.010; use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => '_842899', CLEAN_AFTER_BUILD => 0; void sumEm( SV acc, SV SoS, SV in, int n ) { int iAcc = (int)SvPVX( acc ); int iIn = (int)SvPVX( in ); int iSoS = (int)SvPVX( SoS ); int i; for( i=0; i < n; ++i ) { iAcc[ i ] += iIn[ i ]; iSoS[ i ] += iIn[ i ] iIn[ i ]; } } END_C use Time::HiRes qw[ time ]; my $start = time; my $acc = chr(0) x 4e6; my $sos = $acc; open I, '<:raw', 'cells' or die $!; while( sysread( I, my $row, 4e6 ) ) { sumEm( $acc, $sos, $row, 1e6 ); } close I; printf "Took: %.6f seconds\n", time() - $start; <STDIN>; my @sums = unpack 'V', $acc; my @SoSs = unpack 'V', $sos; print "$sums[ $_ ] : $SoSs[ $_ ]" for 0 .. $#sums; __END__ C:\test>842899.pl Took: 39.318000 seconds [download] So depending how long your current method takes, it might be worth considering. For your two "other types of query", on the surface at least, it sounds like they could be quite easily solved using SQL. And calculating means and std deviations is bread & butter SQL. Given the more realistic volumes of data you are now describing, an RDBMS, or even SQLite seems like they might be a good fit for your tasks. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]

Replies are listed 'Best First'.

Re^4: Optimizing a double loop
by BrowserUk (Patriarch) on Jun 04, 2010 at 14:52 UTC

But perhaps I can store the sets more efficiently? Maybe storing them all in one file is better?

Given a file 4GB containing 1000 sets of 1 million integers stored in packed binary form, the following Inline::C code accumulated the sums and sums of squares in just under 40 seconds. With some more effort that could be parallelised, but it probably isn't necessary:

#! perl -slw
use 5.010;
use strict;
use Inline C => Config => BUILD_NOISY => 1;
use Inline C => <<'END_C',  NAME => '_842899', CLEAN_AFTER_BUILD => 0;

void sumEm( SV *acc, SV *SoS, SV *in, int n ) {
    int *iAcc = (int*)SvPVX( acc );
    int *iIn  = (int*)SvPVX( in );
    int *iSoS = (int*)SvPVX( SoS );
    int i;

    for( i=0; i < n; ++i ) {
        iAcc[ i ] += iIn[ i ];
        iSoS[ i ] += iIn[ i ] * iIn[ i ];
    }
}

END_C

use Time::HiRes qw[ time ];

my $start = time;

my $acc = chr(0) x 4e6;
my $sos = $acc;

open I, '<:raw', 'cells' or die $!;
while( sysread( I, my $row, 4e6 ) ) {
    sumEm( $acc, $sos, $row, 1e6 );
}
close I;

printf "Took: %.6f seconds\n", time() - $start;

<STDIN>;

my @sums = unpack 'V*', $acc;
my @SoSs = unpack 'V*', $sos;

print "$sums[ $_ ] : $SoSs[ $_ ]" for 0 .. $#sums;

__END__
C:\test>842899.pl
Took: 39.318000 seconds
[download]

So depending how long your current method takes, it might be worth considering.

For your two "other types of query", on the surface at least, it sounds like they could be quite easily solved using SQL. And calculating means and std deviations is bread & butter SQL.

Given the more realistic volumes of data you are now describing, an RDBMS, or even SQLite seems like they might be a good fit for your tasks.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

[reply]
[d/l]