in reply to Re: Search a hash for closest match
in thread Search a hash for closest match

Thanks Kenneth, I had something similar in mind. I think I'm also going to pre-sort the HoA then try to do a binary search to speed up the lookup. Sound like a reasonable approach?

Replies are listed 'Best First'.
Re^3: Search a hash for closest match
by kennethk (Abbot) on Nov 01, 2011 at 18:00 UTC

    Profile - like the great Knuth said, "Premature optimization is the root of all evil."

    Based upon the specs you give in Re^2: Search a hash for closest match, I would expect that linear search is not even remotely your optimal solution, and that using Perl's sort (Merge sort) followed by binary search is probably pretty derned good. However, better than fast code is functional code. I would highly recommend developing your script with a very simple bit of logic like this and small data sets, and improving the search algorithm later in the development cycle once it becomes necessary. YMMV.

      I think my response may have been a bit unclear/ambiguous:

      there are only 928K positions in total, so we can roughly estimate 40K elements per array (1M sites/25 chromosomes) and only 1 array will have to be searched based on the chromosome given.

        I followed what you said. Allow me to illustrate my point with an exchange I had with a colleague yesterday. He came into my office, asking me how to optimize sorting on a large set of objects. He thought perl was running pretty slowly, and this seemed to him to be the issue. We had a lovely discussion where I suggested he use the Orcish Maneuver as a solution for dropping the expense of his sort. I also suggested strongly that he should think about running his code through Devel::NYTProf to find out if the bottleneck was where he thought. And of course it wasn't, and his sort speed was completely irrelevant.

        Your proposed sorting algorithm is probably a good optimization. But make your script work first, then make it work quickly. The more complex the search algorithm, the easier it is to make a stupid mistake. Linear search may be slow, but it's damn hard to introduce an inobvious bug to it.