in reply to Similarity searching

Well, how much time do you have?

If you have 2e9 subject histograms and 5e5 query histograms and you would compare each against each other, then this is 1e15 comparisons (not mentioning finding the best fit). If, for a moment, we assume that a day has 1e5 seconds (a bit less in reality), then you need to do 1e10 comparisons per second (10 GHz).

I would conclude that you need a lot of time to do this exercise. Unless you have some additional knowledge about the distribution of your data (and I admit that I do not really understand what you proposed as a potential algorithm).

If you encode your data as vectors of length 8000 where each element stores the value of the histogram, then the distance between a subject histogram and a query histogram is the sum of the element-wise minima. This would add another multiple to the number of operations required...

Replies are listed 'Best First'.
Re^2: Similarity searching
by baxy77bax (Deacon) on Jan 25, 2014 at 16:53 UTC
    "I would conclude that you need a lot of time to do this exercise."

    that is the problem. "Unless you have some additional knowledge about the distribution of your data"

    no that is all i know about my data (more-or-less) i vould need to do some basin statistics on the data but that will take a while. this is a smaller part of a bigger project that is based on a lot of statistical info and as the project is reaching its end statistical assumptions that i have made to increase speed are not encouraging. the quality of my results has drastically decreased so now i am trying to fast broutforce as much i know/you guys help/other people suggest. and therefore any input is more then welcomed. thank you for that

    "and I admit that I do not really understand what you proposed as a potential algorithm"

    it is suppose to be a modification of nearest neighbor search with artificially designed pivots but now as a read it again it is a stupid suggestion that will never work, since my consensus histogram is constructed of maximums from the subject set therefore similarity score for each subject is just the sum of columns which is what i had in the first place. as you can see i am going in circles here.

      a modification of nearest neighbor search
      You might want to have a look at Geo::DNA for some useful techniques to aid in proximity searching. It encodes data into a DNA like string which then allows you to use stem-based searching to find nearest neighbours.
      so now i am trying to fast broutforce as much i know/you guys help/other people suggest.

      What sort of hardware do you have available to process this stuff?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.