glwtta has asked for the wisdom of the Perl Monks concerning the following question:

Ok, a bit of a CS question here (I've been scratching my head for a while):

I have a sizeable array of segments (> 100K), each element basically has a length of about 100-1000 and a start/end value anywhere between 1 and a few hundred million.

I would like to be able to quickly retrieve a list of segments that fall within a specified range. so $some_tree_like_structre->get_range(50000, 1000000); would produce a list of elements which have end position >= 50000 and start position <= 1000000.

I've generally confused myself trying to google for this last night - (B|R|KD|BSP|H|P)[+-*] trees all sound fun, but I am not sure what would help me here.

A shove in the right direction would be greatly appreciated here (I am not looking for code, just need a place to start).

Thanks

Replies are listed 'Best First'.
Re: indexing segments
by Abigail-II (Bishop) on Oct 10, 2003 at 12:48 UTC
    Look up "segment trees" or "interval trees", two very common datastructures in the Computation Geometry and Algorithms/Datastructures fields.

    I doubt there's much code on CPAN for this, but neither of them are hard to implement. I've programmed them out, but that was long before I knew Perl.

    Abigail

      Thanks! Exactly what I was looking for, just a case of not knowing what it's called - "interval trees" did the trick.
Re: indexing segments
by EvdB (Deacon) on Oct 10, 2003 at 12:45 UTC
    Not really an answer to your question but a solution:

    You could create a database table such as:

    create table elements ( position int8, start_value int8, end_value int8 );

    Then, after filling the table, this query would give you the desired result:

    select position from element where end_value >= 50000 and start_value +<= 1000000;

    This effectively gives the tree hassle to the database.

    UPDATE: Changed the SQL query to reflect the original query in the question (and a typo)

    --tidiness is the memory loss of environmental mnemonics

      This effectively give the tree hassle to the database.

      Well, it's more saying to the database "please perform a linear search for me". Kind of an overkill to use a database for that, you might as well use a grep statement in Perl. And no, creating indices on start_value and end_value isn't going to help you in this case - it may still lead to a table scan (and hence a linear search), and just report a single match.

      Abigail

        I don't think it is a straight linear scan - there is possible confusion over what the data is:
        I have a sizeable array of segments (> 100K), each element basically has a length of about 100-1000 and a start/end value anywhere between 1 and a few hundred million.

        If it is just a list of single values then 'a length of about 100-1000' and 'a start/end value anywhere between 1 and a few hundred million' are mutually exclusive.

        I may be completely wrong but it would appear that each array element is more involved than just a single value but can be summarised with a start and end value. This make the database more suitable.

        There are also the below mentioned storage considerations.

        That said if it is just a list of values you are correct - grep or similar would be the way forward.

        --tidiness is the memory loss of environmental mnemonics

      Hm, I'm pretty sure this would do a seq scan on the table.

      In any case, that doesn't help me; my problem is that a linear search in memory would be too slow, throwing a database in there is definitely not going to help.

Re: indexing segments
by rinceWind (Monsignor) on Oct 10, 2003 at 12:50 UTC
    I think that your main issue here is not retrieval but storage. You say that you have a sizeable array of segments (> 100K). Are these all held in memory? If they are, it's a simple matter of using a slice, as in @foo[50000..100000].

    Assuming that your data is not in memory, it must be on disk somewhere, in a flat file, or possibly a database. n the latter case, your retrieval then becomes a matter of SQL, as in:

    select * from foo where ID between 50000 and 1000000
    Hope this helps

    --
    I'm Not Just Another Perl Hacker
      Are these all held in memory? If they are, it's a simple matter of using a slice, as in @foo50000..100000.

      I think the way I described the problem was confusing. Yes, the array is held in memory, but the array index for each element is not related to the segment/interval values it holds.

      Each element can be represented as something along the lines of: {id => 50, start => 75000, end => 75500}

Re: indexing segments
by fglock (Vicar) on Oct 10, 2003 at 13:17 UTC

    There is some sample code you could reuse in DateTime::TimeZone (see the _spans_binary_search method)

    The Set::Infinite module covers this problem, but it does not try to be "efficient".

    Anyway, 100k is a lot of data, and you'd better not keep it in RAM. If you follow Abigail directions you'll find a lot of papers in google about how to do this using a database.

      Anyway, 100k is a lot of data, and you'd better not keep it in RAM.

      The bare minimum for these is an int for the id, and two ints for the coordinates. Of course with perl there's quite a bit of overhead, but it should still be manageable on 6 Gigs of RAM.

Re: indexing segments
by hardburn (Abbot) on Oct 10, 2003 at 13:50 UTC

    You'll probably be better off with some of the datastructures suggested here, but this is the solution I've thought of:

    Assume that your elements are in the array @segements. Sort the data by length (sort { length $a <=> length $b } @segements;) (I'm not sure exactly what algorithm perl uses for sort, but I'm sure it's fairly efficent). You can now use a regular slice to get your data (my @range = @segements[50000 .. 100000];) and then scan through @range to lop off any elements that are too big or small. A simple grep will work, but if you want to be really efficent, you can process the data in two foreach loops and stop when you've hit your desired range, like this:

    my ($start, $end); foreach my $i (0 .. $#range) { if($range[$i] >= 50000) { $start = $i; last; } } foreach my $i ($#range .. 0) { if($range[$i] <= 100000) { $end = $i; last; } } my @wanted_data = @range[$start .. $end];

    You could use these two loops without the initial sort and splice, but doing the sort and splice could significantly reduce the ammount of linear scanning done.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated