Similarity searching

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

This is more of a discussion question then a code seeker. First i'll explain the problem on a small case scenario and then provide additional real size input parameters.Problem:

I have a database of histograms like this:

hist:1

1  ###
3  ####
5  #######
17 #
21 #

hist:2

1  ###
5  ##
17 #####
20 #
22 ##

hist:3

3  #
10 ###
12 #

..
[download]

each histogram has up to 5 columns. each column corresponds to a value form 1 to 25. Moreover I'll refer to those histograms as subject hist. On the other hand i have a set of query histograms which is approx. 1000 times smaller then the subject set. What i need to do is: for each query histogram find the closest subject histogram. how do i define "closest"? well closest refers to the number of shared data points (# -> data point) so let say i have histogram x that looks like this:

hist:x

1  ##
4  ####
5  ####
12 #
17 ##########

(updated 12 comes before 17)
[download]

then
(x,1) => share 7 data points
(x,2) => share 9 data points
(x,3) => share 1 data point
from the above the closest to x is histogram 2. Since the number of subject histograms is to large that i cannot put it into memory (approx 2x10^9 histograms - each in real case scenario containing approx 300 columns and there is a maximum of 8000 possible column labels(values)) i thought i should create a consensus histogram from all subject ones. such that this histogram has all 25 columns (now i am again talking about my example) and each column having the maximum number of data points (this is computed from the subject set- if the max number of data points for column 1 is 100 then this how large column 1 in my consensus hist will be.) now i compare each subject hist to my consensus hist and get some similarity score. i keep those scores in memory (i believe i have enough memory for that) if two or more subject histograms have the same score then i recursively repeat the procedure. once completed, i take each histogram from my query set and compare it to consensus histogram and then calculate the similarity score. from triangular inequality it follows that the closest sequence in the subject (the one that has the score closest to my d(query,cons)) set will not have a score when compared directly to my query hist larger then: d(query,cons)+d(subj,cons). which will allow me to pick a part of the subject set more carefully and therefore find the most similar subject histogram to each of my query histograms. the real number of query histograms is between 100000 and 500000. so this is the best i could think of (read about). does someone know a better way. i believe i could improve my procedure if i could somehow cheat on similarity score computation, because now the problem is that the distance between query and subject is less or equal to dist.(query, consensus) + dist.(subject, consensus). and this is because scoring does not depend on the position.

query

1 ##          consensus
2 #     ->    1 ##
3 ##          2 ##
              3 ##
subject       4 ##
              5 ##
4 ##    ->    6 ##
5 ##
6 #
[download]

d(q,c) = 5
d(s,c) = 5

d(q,s) = 0

(d = similarity score)

also speed is of the highest importance!!!!
thank you

I cannot use any known database engine

Comment on Similarity searching Select or Download Code

Replies are listed 'Best First'.
Re: Similarity searching by hdb (Monsignor) on Jan 25, 2014 at 16:27 UTC
Well, how much time do you have? If you have 2e9 subject histograms and 5e5 query histograms and you would compare each against each other, then this is 1e15 comparisons (not mentioning finding the best fit). If, for a moment, we assume that a day has 1e5 seconds (a bit less in reality), then you need to do 1e10 comparisons per second (10 GHz). I would conclude that you need a lot of time to do this exercise. Unless you have some additional knowledge about the distribution of your data (and I admit that I do not really understand what you proposed as a potential algorithm). If you encode your data as vectors of length 8000 where each element stores the value of the histogram, then the distance between a subject histogram and a query histogram is the sum of the element-wise minima. This would add another multiple to the number of operations required...	[reply]
Re^2: Similarity searching by baxy77bax (Deacon) on Jan 25, 2014 at 16:53 UTC
"I would conclude that you need a lot of time to do this exercise." that is the problem. "Unless you have some additional knowledge about the distribution of your data" no that is all i know about my data (more-or-less) i vould need to do some basin statistics on the data but that will take a while. this is a smaller part of a bigger project that is based on a lot of statistical info and as the project is reaching its end statistical assumptions that i have made to increase speed are not encouraging. the quality of my results has drastically decreased so now i am trying to fast broutforce as much i know/you guys help/other people suggest. and therefore any input is more then welcomed. thank you for that "and I admit that I do not really understand what you proposed as a potential algorithm" it is suppose to be a modification of nearest neighbor search with artificially designed pivots but now as a read it again it is a stupid suggestion that will never work, since my consensus histogram is constructed of maximums from the subject set therefore similarity score for each subject is just the sum of columns which is what i had in the first place. as you can see i am going in circles here.	[reply]
Re^3: Similarity searching by tangent (Parson) on Jan 25, 2014 at 18:13 UTC
a modification of nearest neighbor search You might want to have a look at Geo::DNA for some useful techniques to aid in proximity searching. It encodes data into a DNA like string which then allows you to use stem-based searching to find nearest neighbours.	[reply]
Re^3: Similarity searching by BrowserUk (Patriarch) on Jan 25, 2014 at 17:24 UTC
so now i am trying to fast broutforce as much i know/you guys help/other people suggest. What sort of hardware do you have available to process this stuff? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Similarity searching by BrowserUk (Patriarch) on Jan 25, 2014 at 15:30 UTC
17 comes before 12 or just a typo? `hist:x 1 ## 4 #### 5 #### 17 ########## 12 #` [download] I cannot use any known database engine Really? Why not? If true, the first thing I would do is write a short program to do a single pass over your 2e9 silly format files and output a single file formatted like so: `1: 1(3) 3(4) 5(7) 17(1) 21(1) 2: 1(2) 3(2) 17(5) 20(1) 22(2) 3: 3(1) 10(3) 12(1) ...` [download] Then I could dump all those silly format files. Then I'd look to reformat that single file into some kind of consistent record format, but then I read this bit of your description: each in real case scenario containing approx 300 columns and there is a maximum of 8000 possible column labels(values)) i thought i should create a consensus histogram from all subject ones. such that this histogram has all 25 columns (now i am again talking about my example) and each column having the maximum number of data points (this is computed from the subject set- if the max number of data points for column 1 is 100 then this how large column 1 in my consensus hist will be.) And got completely lost in the number of columns and ranges of values for each column... With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Similarity searching by baxy77bax (Deacon) on Jan 25, 2014 at 16:05 UTC
"17 comes before 12 or just a typo? " typo! they are always ordered. "If true, the first thing I would do is write a short program to do a single pass over your 2e9 silly format files and output a single file formatted like so:" yes it is already in some "nicely" formated style but for the purposes of visualization i used histograms (i thought it vould be easier to understand the problem). "And got completely lost in the number of columns and ranges of values for each column..." so each histogram can have up to 300 columns. let say each column label is a number, then 300 does not imply that columns are labeled from 1 to 300 but the label range is 1-8000. from those 8000 labels each histogram has at most 300 different labels. (# of different ways you can pick 300 out of 8000 at most) the size of the column does not have a maximum value. i hope i clarified the problem a bit. cheers	[reply]
Re^3: Similarity searching by BrowserUk (Patriarch) on Jan 25, 2014 at 16:19 UTC
i hope i clarified the problem a bit. Lots :) But ... "the size of the column does not have a maximum value.". Even if there is no set upper limit, there is obviously a maximum value as found within your dataset. How big is that number? (The exact value is of little importance here, but the scale of the value will be very informative.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Similarity searching by baxy77bax (Deacon) on Jan 25, 2014 at 16:37 UTC
Re: Similarity searching by oiskuu (Hermit) on Jan 25, 2014 at 17:24 UTC
You quote specific parameters, leading one to infer that an actual real-world problem is involved. Can you tell us what it is? Shopping cart analytics? There was a thread not long ago, that might also lend some insight to your problem. Is it possible to analyze (a representative portion of) the subject data for clustering—perhaps the 8000-dimensional data can be flattened to much fewer? And where multiple clusters exist, one may divide the set.	[reply]