in reply to Re^4: Comparing sets of phrases stored in a database?
in thread Comparing sets of phrases stored in a database?
If I do it when I want a similarity I have to compare against every single set in my database every time, which seems a little suboptimal.
The beauty of using bitstrings is that the comparisons are very fast. For example, the following compares each of 15,000 bitstrings against the sample input in less that 1/10th of a second:
#! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use Time::HiRes qw[ time ]; sub comp { my( $a, $b ) = @_; my $total = unpack( '%32b*', $a | $b ); return ( unpack( '%32b*', $a & $b ) / $total ) - ( unpack( '%32b*', $a ^ $b ) / $total ); } our $N //= 20; my @sets = map pack( 'Q', int rand 2**64 ), 1 .. $N; our $S //= '0101010101010101010101010101010101010101010101010101010101 +010101'; $S = pack 'b*', $S; our $T //= 0.8; my $start = time; print 'Input: ', unpack 'b*', $S; for ( 0 .. $#sets ) { my $score = comp( $sets[ $_ ], $S ); if( $score > $T ) { printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ]; } } printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $ +start; __END__ C:\test>996530 -N=15000 -S=1111111111111111111111111111111100000000000 +000000000000000000000 -T=0.25 Input: 111111111111111111111111111111110000000000000000000000000000 +0000 0.280000: 111111111111111111111111111111111111111111111111100000000000 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000010 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000000 +1001 0.254902: 111111111111111111111111111111111111111111111111101000000000 +0001 0.254902: 111111111111111111111111111111111111111111111111100000000000 +1001 0.254902: 111111111111111111111111111111111111111111111111100000000100 +0001 Comparing 1 against 15000 sets, took: 0.069085 seconds
The bitstrings used can present a corpus of up to 64 phrases -- just for convenience -- but even if you multiply the size of the corpus by 10:
#! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use Time::HiRes qw[ time ]; sub comp { my( $a, $b ) = @_; my $total = unpack( '%32b*', $a | $b ); return ( unpack( '%32b*', $a & $b ) / $total ) - ( unpack( '%32b*', $a ^ $b ) / $total ); } our $N //= 20; my @sets = map{ pack( 'Q', int rand 2**64 ) x 10 } 1 .. $N; our $S //= '0101010101010101010101010101010101010101010101010101010101 +010101'; $S = pack( 'b*', $S ) x 10; our $T //= 0.8; my $start = time; print 'input: ', unpack 'b*', $S; for ( 0 .. $#sets ) { my $score = comp( $sets[ $_ ], $S ); if( $score > $T ) { printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ]; } } printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $ +start; __END__ C:\test>996530 -N=15000 -T=0.1 input: 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 010101010101010101010101010101010101010101010101010101010101 +01010101010101010101 0.103448: 111111111111111111111111111111111111111111111111101010111010 +10111111111111111111 111111111111111111111111111111111010101110101011111111111111 +11111111111111111111 111111111111111110101011101010111111111111111111111111111111 +11111111111111111111 101010111010101111111111111111111111111111111111111111111111 +11111010101110101011 111111111111111111111111111111111111111111111111101010111010 +10111111111111111111 111111111111111111111111111111111010101110101011111111111111 +11111111111111111111 111111111111111110101011101010111111111111111111111111111111 +11111111111111111111 101010111010101111111111111111111111111111111111111111111111 +11111010101110101011 0.103448: 111111111111111111111111111111111111111111111111110111010101 +01011111111111111111 111111111111111111111111111111111101110101010101111111111111 +11111111111111111111 111111111111111111011101010101011111111111111111111111111111 +11111111111111111111 110111010101010111111111111111111111111111111111111111111111 +11111101110101010101 111111111111111111111111111111111111111111111111110111010101 +01011111111111111111 111111111111111111111111111111111101110101010101111111111111 +11111111111111111111 111111111111111111011101010101011111111111111111111111111111 +11111111111111111111 110111010101010111111111111111111111111111111111111111111111 +11111101110101010101 0.122807: 111111111111111111111111111111111111111111111111110101010101 +01011111111111111111 111111111111111111111111111111111101010101010101111111111111 +11111111111111111111 111111111111111111010101010101011111111111111111111111111111 +11111111111111111111 110101010101010111111111111111111111111111111111111111111111 +11111101010101010101 111111111111111111111111111111111111111111111111110101010101 +01011111111111111111 111111111111111111111111111111111101010101010101111111111111 +11111111111111111111 111111111111111111010101010101011111111111111111111111111111 +11111111111111111111 110101010101010111111111111111111111111111111111111111111111 +11111101010101010101 Comparing 1 against 15000 sets, took: 0.069599 seconds
The time required barely changes: 0.065085 .v. 0.069599, despite doing 10 times the amount of work.
Used with imagination, bitstrings are the best kept secret of of data processing.
|
---|