in reply to Code efficiency / algorithm

This is a known problem in the literature. It's called a stabbing query into line segments. The entries in datafile1 can be considered as line segments, while the entries in datafile2 are the points you query with. So, if you are looking for an efficient algorithm, you'd need to build a segment tree (or an interval tree) from the data in datafile1, and perform queries with the data in datafile2.

IIRC, you can build a segment tree in O (n log2 n) time, while queries take O (log2 n + k) time. So, if you have N entries in datafile1 and M entries in datafile2, your total running time would be O ((N + M) log2 N + K), where K is the number of matches. It might be that it's actually a factor of O (log N) less. This surely beats the O (N * M) approach of trying to match everything with everything.

Unfortunally, I don't think there's a module available that implements segment trees.

Abigail

Replies are listed 'Best First'.
Re: Re: Code efficiency / algorithm
by BrowserUk (Patriarch) on Jan 14, 2003 at 11:15 UTC

    Hope you don't mind me asking this, but could you give an assesment of how the algorithm in Re: Code efficiency / algorithm rates big-O wise. I never really mastered that notation.


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      Well, for each entry in datafile2, you consider all the elements in datafile1. So your algorithm takes at least Omega (N * M) time.

      Abigail

Re: Re: Code efficiency / algorithm
by dws (Chancellor) on Jan 14, 2003 at 21:53 UTC
    IIRC, you can build a segment tree in O (n log2 n) time, while queries take O (log2 n + k) time.

    I wonder if there isn't a way to get O(1) query time by exploiting the nature of the keys and building a different data structure.

    The keys here are dates. There aren't that many distinct dates in a year. By converting all dates to an offset from some given starting date, you can build an array indexible by date offset. Each element of this array, if present, holds an array of references to records that are "effective" on that date.

      Great idea, dws. One modification, though. Since the date information is sparse, a hash of the distinct dates (or date offsets) would be better than an array.
Re: Re: Code efficiency / algorithm
by dave8775 (Novice) on Jan 14, 2003 at 16:06 UTC
    :( Does this mean my problem cannot be fixed with code? I need to actually implement a different algorithm? I will go check up on that error and see what I can do to mitigate it. If I can't can you offer any tips on how to convert to the tree algorithm? Is it an easy conversion/coding or am I looking at something pretty sophisticated here? Thanks for the help!!! David =8-)
      Of course your problem can be fixed with code. It's not that you are asking for something that's not solvable on a Turing machine. As for wether you need to implement a different algorithm, that's up to you. I described an efficient algorithm. If you find an implementation of a naive algorithm (matching everything from one file with every thing from the other) fast enough, then stick with it. But I got the impression you had some very large datasets. And very large times very large is big.

      As for the algorithm being sophisticated, that depends on what you find sophisticated. I don't find the algorithms itself sophisticated, but then, I did research in this field for a few years. And I wouldn't do the implementation in an evening either. But I'm a slow coder.

      Abigail

      It looks like Set::Infinite and Array::IntSpan could be helpful starting points.

      If all the companies had mutually-exclusive ranges, then you could simply place them in an Array::IntSpan and look them up. But since your ranges intersect, you will need to intersect them with each other (using the operations of Set::Infinite like intersects and contains). That's basically what a tree algorithm would do.