in reply to Comparing images to find similar images in a database

As I understand the author merlin defines a match as follows:

=13= my $FUZZ = 5; # permitted average deviation in the vector ele +ments ... =66= BUCKET: for my $bucket (@buckets) { =67= my $error = 0; =68= INDEX: for my $index (0..$#vector) { =69= $error += abs($bucket->[0][$index] - $vector[$index]); =70= next BUCKET if $error > $FUZZ * @vector; =71= } ...

IMHO the above set of matches is a subset of all matches where

$pattern_sum += @pattern_vector; $upper_bound = $pattern_sum + $FUZZ * @pattern_vector; $lower_bound = $pattern_sum + $FUZZ * @pattern_vector; BUCKET: for my $bucket (@buckets) { my $bucket_sum += @{$bucket->[0]}; next BUCKET if ($bucket_sum > $upper_bound || $bucket_sum < $lower_bound); # found, do something }

Depending on the randomness the matches will be roughly doubled, i.e. with $FUZZ=5, and a vector with 48 elements each an 8-bit integer, the number of possible different sums of the vector values is 48*255+1=12_241. The original method gives 48*5=240 as maximal allowed sum of the absolute differences. Thus the set of all possible sums is reduced by a factor of 12_241/240=51. When we use an interval of +/- 5, then 48*5*2=480, and the reduction is only 25. This means 1_000 images found out of a total of 25_000 images.

But if we calculate the sum of the vector, we can store it as an integer field in the database and use SQL comparisons.

The query result could still be refined using the original method, or something better like e.g. cosine similarity, which should be fast enough for ~1_000 vectors.