indexing segments

glwtta has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: indexing segments by Abigail-II (Bishop) on Oct 10, 2003 at 12:48 UTC
Look up "segment trees" or "interval trees", two very common datastructures in the Computation Geometry and Algorithms/Datastructures fields. I doubt there's much code on CPAN for this, but neither of them are hard to implement. I've programmed them out, but that was long before I knew Perl. Abigail	[reply]
Re: Re: indexing segments by glwtta (Hermit) on Oct 10, 2003 at 13:40 UTC
Thanks! Exactly what I was looking for, just a case of not knowing what it's called - "interval trees" did the trick.	[reply]
Re: indexing segments by EvdB (Deacon) on Oct 10, 2003 at 12:45 UTC
Not really an answer to your question but a solution: You could create a database table such as: `create table elements ( position int8, start_value int8, end_value int8 );` [download] Then, after filling the table, this query would give you the desired result: `select position from element where end_value >= 50000 and start_value +<= 1000000;` [download] This effectively gives the tree hassle to the database. UPDATE: Changed the SQL query to reflect the original query in the question (and a typo) --`tidiness is the memory loss of environmental mnemonics`	[reply] [d/l] [select]
Re: indexing segments by Abigail-II (Bishop) on Oct 10, 2003 at 12:53 UTC
This effectively give the tree hassle to the database. Well, it's more saying to the database "please perform a linear search for me". Kind of an overkill to use a database for that, you might as well use a grep statement in Perl. And no, creating indices on start_value and end_value isn't going to help you in this case - it may still lead to a table scan (and hence a linear search), and just report a single match. Abigail	[reply]
Re: Re: indexing segments by EvdB (Deacon) on Oct 10, 2003 at 13:03 UTC
I don't think it is a straight linear scan - there is possible confusion over what the data is: I have a sizeable array of segments (> 100K), each element basically has a length of about 100-1000 and a start/end value anywhere between 1 and a few hundred million. If it is just a list of single values then 'a length of about 100-1000' and 'a start/end value anywhere between 1 and a few hundred million' are mutually exclusive. I may be completely wrong but it would appear that each array element is more involved than just a single value but can be summarised with a start and end value. This make the database more suitable. There are also the below mentioned storage considerations. That said if it is just a list of values you are correct - grep or similar would be the way forward. --`tidiness is the memory loss of environmental mnemonics`	[reply] [d/l]
Re: Re: Re: indexing segments by glwtta (Hermit) on Oct 10, 2003 at 13:35 UTC
Re: Re: Re: Re: indexing segments by EvdB (Deacon) on Oct 10, 2003 at 13:50 UTC
Re: Re: indexing segments by glwtta (Hermit) on Oct 10, 2003 at 13:21 UTC
Hm, I'm pretty sure this would do a seq scan on the table. In any case, that doesn't help me; my problem is that a linear search in memory would be too slow, throwing a database in there is definitely not going to help.	[reply]
Re: indexing segments by rinceWind (Monsignor) on Oct 10, 2003 at 12:50 UTC
I think that your main issue here is not retrieval but storage. You say that you have a sizeable array of segments (> 100K). Are these all held in memory? If they are, it's a simple matter of using a slice, as in @foo[50000..100000]. Assuming that your data is not in memory, it must be on disk somewhere, in a flat file, or possibly a database. n the latter case, your retrieval then becomes a matter of SQL, as in: `select * from foo where ID between 50000 and 1000000` [download] Hope this helps -- I'm Not Just Another Perl Hacker	[reply] [d/l]
Re: Re: indexing segments by glwtta (Hermit) on Oct 10, 2003 at 13:31 UTC
Are these all held in memory? If they are, it's a simple matter of using a slice, as in @foo50000..100000. I think the way I described the problem was confusing. Yes, the array is held in memory, but the array index for each element is not related to the segment/interval values it holds. Each element can be represented as something along the lines of: `{id => 50, start => 75000, end => 75500}`	[reply] [d/l]
Re: indexing segments by fglock (Vicar) on Oct 10, 2003 at 13:17 UTC
There is some sample code you could reuse in DateTime::TimeZone (see the `_spans_binary_search` method) The Set::Infinite module covers this problem, but it does not try to be "efficient". Anyway, 100k is a lot of data, and you'd better not keep it in RAM. If you follow Abigail directions you'll find a lot of papers in google about how to do this using a database.	[reply] [d/l]
Re: Re: indexing segments by glwtta (Hermit) on Oct 10, 2003 at 13:59 UTC
Anyway, 100k is a lot of data, and you'd better not keep it in RAM. The bare minimum for these is an int for the id, and two ints for the coordinates. Of course with perl there's quite a bit of overhead, but it should still be manageable on 6 Gigs of RAM.	[reply]
Re: indexing segments by hardburn (Abbot) on Oct 10, 2003 at 13:50 UTC
You'll probably be better off with some of the datastructures suggested here, but this is the solution I've thought of: Assume that your elements are in the array `@segements`. Sort the data by length (`sort { length $a <=> length $b } @segements;`) (I'm not sure exactly what algorithm perl uses for `sort`, but I'm sure it's fairly efficent). You can now use a regular slice to get your data (`my @range = @segements[50000 .. 100000];`) and then scan through `@range` to lop off any elements that are too big or small. A simple `grep` will work, but if you want to be really efficent, you can process the data in two `foreach` loops and stop when you've hit your desired range, like this: `my ($start, $end); foreach my $i (0 .. $#range) { if($range[$i] >= 50000) { $start = $i; last; } } foreach my $i ($#range .. 0) { if($range[$i] <= 100000) { $end = $i; last; } } my @wanted_data = @range[$start .. $end];` [download] You could use these two loops without the initial sort and splice, but doing the sort and splice could significantly reduce the ammount of linear scanning done. ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer Note: All code is untested, unless otherwise stated	[reply] [d/l] [select]