Re: indexing segments
by Abigail-II (Bishop) on Oct 10, 2003 at 12:48 UTC
|
Look up "segment trees" or "interval trees", two very common
datastructures in the Computation Geometry and Algorithms/Datastructures fields.
I doubt there's much code on CPAN for this, but neither of
them are hard to implement. I've programmed them out, but that
was long before I knew Perl.
Abigail | [reply] |
|
Thanks! Exactly what I was looking for, just a case of not knowing what it's called - "interval trees" did the trick.
| [reply] |
Re: indexing segments
by EvdB (Deacon) on Oct 10, 2003 at 12:45 UTC
|
create table elements (
position int8,
start_value int8,
end_value int8
);
Then, after filling the table, this query would give you the desired result:
select position from element where end_value >= 50000 and start_value
+<= 1000000;
This effectively gives the tree hassle to the database.
UPDATE: Changed the SQL query to reflect the original query in the question (and a typo)
--tidiness is the memory loss of environmental mnemonics
| [reply] [d/l] [select] |
|
This effectively give the tree hassle to the database.
Well, it's more saying to the database "please perform a
linear search for me". Kind of an overkill to use a database
for that, you might as well use a grep statement in Perl.
And no, creating indices on start_value and end_value isn't
going to help you in this case - it may still lead to a table
scan (and hence a linear search), and just report a single
match.
Abigail
| [reply] |
|
I don't think it is a straight linear scan - there is possible confusion over what the data is:
I have a sizeable array of segments (> 100K), each element basically has a length of about 100-1000 and a start/end value anywhere between 1 and a few hundred million.
If it is just a list of single values then 'a length of about 100-1000' and 'a start/end value anywhere between 1 and a few hundred million' are mutually exclusive.
I may be completely wrong but it would appear that each array element is more involved than just a single value but can be summarised with a start and end value. This make the database more suitable.
There are also the below mentioned storage considerations.
That said if it is just a list of values you are correct - grep or similar would be the way forward.
--tidiness is the memory loss of environmental mnemonics
| [reply] [d/l] |
|
|
|
| [reply] |
Re: indexing segments
by rinceWind (Monsignor) on Oct 10, 2003 at 12:50 UTC
|
I think that your main issue here is not retrieval but storage. You say that you have a sizeable array of segments (> 100K). Are these all held in memory? If they are, it's a simple matter of using a slice, as in @foo[50000..100000].
Assuming that your data is not in memory, it must be on disk somewhere, in a flat file, or possibly a database. n the latter case, your retrieval then becomes a matter of SQL, as in:
select * from foo where ID between 50000 and 1000000
Hope this helps
-- I'm Not Just Another Perl Hacker
| [reply] [d/l] |
|
Are these all held in memory? If they are, it's a simple matter of using a slice, as in @foo50000..100000.I think the way I described the problem was confusing. Yes, the array is held in memory, but the array index for each element is not related to the segment/interval values it holds. Each element can be represented as something along the lines of: {id => 50, start => 75000, end => 75500}
| [reply] [d/l] |
Re: indexing segments
by fglock (Vicar) on Oct 10, 2003 at 13:17 UTC
|
There is some sample code you could reuse in
DateTime::TimeZone
(see the _spans_binary_search method)
The Set::Infinite module covers this
problem, but it does not try to be "efficient".
Anyway, 100k is a lot of data, and you'd
better not keep it in RAM. If you follow Abigail
directions you'll find a lot of papers in google
about how to do this using a database.
| [reply] [d/l] |
|
| [reply] |
Re: indexing segments
by hardburn (Abbot) on Oct 10, 2003 at 13:50 UTC
|
You'll probably be better off with some of the datastructures suggested here, but this is the solution I've thought of:
Assume that your elements are in the array @segements. Sort the data by length (sort { length $a <=> length $b } @segements;) (I'm not sure exactly what algorithm perl uses for sort, but I'm sure it's fairly efficent). You can now use a regular slice to get your data (my @range = @segements[50000 .. 100000];) and then scan through @range to lop off any elements that are too big or small. A simple grep will work, but if you want to be really efficent, you can process the data in two foreach loops and stop when you've hit your desired range, like this:
my ($start, $end);
foreach my $i (0 .. $#range) {
if($range[$i] >= 50000) {
$start = $i;
last;
}
}
foreach my $i ($#range .. 0) {
if($range[$i] <= 100000) {
$end = $i;
last;
}
}
my @wanted_data = @range[$start .. $end];
You could use these two loops without the initial sort and splice, but doing the sort and splice could significantly reduce the ammount of linear scanning done.
---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
-- Schemer
Note: All code is untested, unless otherwise stated
| [reply] [d/l] [select] |