in reply to Re^2: Optimizing a double loop
in thread Optimizing a double loop
Let me describe the complete scenario and try to make things more clear.
I start by generating 1000 sets of ranges. Generating the sets is done using some other data that needs parsing etc., but it is done only once. From now on, the sets are fixed and I do not change them.
I chose to store each set of ranges as a hash with key= range start point, value = array of lengths of all ranges that start at that start point. For example, the set of ranges ((100,200),(50,70),(50,100)) will be stored as: {100=>[100], 50=>[20,50]}
Each of these sets is stored to disk. I can then generate a "counts array" by retrieving the required set, converting it into an inversion list (as moritz so kindly suggested) and iterating the whole range once. This works great. I can also do the same for each of the 1000 sets and aggregate the results for each position, then derive the mean and standard deviation for each position in the whole range.
Now, there is another kind of queries I need to answer. Given some condition on ranges, e.g. "start point < x && end point > y", I would like to generate the "counts array" after removing all ranges that answer the condition. So I load the original sets, remove all ranges that apply, and continue through the pipeline. Obviously, I do not store the filtered sets (the filtering was only done for the purpose of the query).
One last type of query is a bit different - instead of generating a "counts array", I would like to tell how many ranges answer some condition (e.g. as the e.g. before). This is also why I think I have to keep the original set and not just the inversion list, which I, by the way, do not store but generate on the fly when "counts array" queries are run.
To conclude, I think I do need to save the original sets on disk, since generating the sets from scratch takes some time and requires reading loading other data as I mentioned in the beginning. I think it makes sense to generate the sets once, then load + filter/answer queries.
But perhaps I can store the sets more efficiently? Maybe storing them all in one file is better?
I should note that some sets contain up to 80000 keys (start points) and take 1.5MB when stored (using nstore), so putting together 1000 sets like this might not be such a good idea.
As always, your ideas are welcomed. I will be more than happy to supply more clarifications.
Thank you all.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Optimizing a double loop
by BrowserUk (Patriarch) on Jun 04, 2010 at 14:52 UTC |