Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: Range overlap (with a lot of ranges)

by Marshall (Canon)
on May 24, 2010 at 13:16 UTC ( [id://841386]=note: print w/replies, xml ) Need Help??


in reply to Re: Range overlap (with a lot of ranges)
in thread Range overlap (with a lot of ranges)

So, it appears to me that to represent the ranges to be tested against, you could make an array based structure (AoA) where first element is the start number of a range and 2nd element is size or end of that range.

To query this structure, do binary search based upon start of range (wind up at same point as if you were "inserting" something into this array before the range), then look forward in the array until it is no longer possible for the sought for range to be in the current range of possible ranges (your start is > start of current 1st element in this "row". So binary search and some further number of comparisons to decide that it is no longer possible to match. This assumes that the ranges can be sorted by "start of range", which it appears that they can be.

so you will have 1 million pointers to array (AoA), each "row" has 2 elements, start and end of that range.

  • Comment on Re^2: Range overlap (with a lot of ranges)

Replies are listed 'Best First'.
Re^3: Range overlap (with a lot of ranges)
by salva (Canon) on May 24, 2010 at 14:42 UTC
    you could make an array based structure (AoA) where first element is the start number of a range and 2nd element is size or end of that range

    That's not very inefficient as it would use a lot of memory unnecessarily. It's better to just use a pair of arrays (@start, @end):

    DB<2> use Devel::Size 'total_size' DB<3> total_size([map [rand, rand], 1..1000]) DB<4> use Devel::Size 'total_size' DB<5> x total_size([map [rand, rand], 1..1000]) 0 224176 DB<6> x total_size([[map rand, 1..1000], [map rand, 1..1000]]) 0 80464
      Well of course two @vars will take less memory space than a AoA because there is at least an additional array, the array of pointers to array in the AoA!

      So why does this make sense?

      One basic reason why data is collected into combined structures like an array of pointers to array, is so that it is clear what data "goes with" with other data. This is a huge thing in software engineering!

      Another reason is that when we say sort a complex data structure that invovles pointers, we just adjust the pointer values to a single array instead of doing copies of the actual data or having to modify multiple data structures. We can do manipulations on a single "thing", again a huge plus.

      I suggested an AoA, but actually the most analogous thing to a 'C' 2-D array uses even more memory, a Perl AoH, not a Perl AoA!

      I suggested the AoA instead of AoH because the Op is "trying to get the job done" and there isn't going to be a huge amount of code. I think this could be just one page of Perl code. I don't see a need to "super optimize" this.

      Normally the rough priority of coding, in my mind is: 1) clarity, 2) clarity, 3) speed, 4) memory usage. There of course can be reasons and very good reasons to violate this priority scheme. The speed and memory usage actually get better (become less important) according to Moore's Law. The clarity of the code is fixed when you write it. It won't get any more clear even if CPU MIP's and memory prices drop dramatically. If the code will run easily within memory budget (in this case yes) and the main objective is clear code that works (in this case yes), then optimizing the memory usage is not the right thing to do.

        My top priority when coding is that: the program should work and produce the desired result in the expected time frame within the available resources. Style is just second.

        The OP is talking about big datasets and concerned about performance, so in that case memory usage matters.

        Besides that, talking about clarity without showing actual code is pointless

Re^3: Range overlap (with a lot of ranges)
by BioLion (Curate) on May 24, 2010 at 13:30 UTC

    So far a simple lookup based on a sorted array (each element storing the id of the range that covers it) hasn't thrashed my computer, but I am still testing. If this fails I will try a binary search (what I meant by 'guess and home in'... no CS background...!). Thanks again.

    Just a something something...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://841386]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-04-20 11:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found