Faster of two options for finding keywords in a database

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Faster of two options for finding keywords in a database by jo37 (Curate) on Feb 17, 2024 at 18:18 UTC
Hi cormanaz! Your two options don't seem to solve the same task. Number one counts the appearances of the keywords as substrings in the JSON string, while number two counts exact matches of the keywords in the keys of the JSON. The "best" solution depends on the number of keywords vs. the average number of keys in the columns. Assuming the number of keywords is larger, it would make sense to reverse the check. Create a hash from the keywords and lookup the keys from the JSON. Removing some indexing gives some more speedup. #!/usr/bin/perl use v5.24; use warnings; use JSON::XS; use Benchmark 'cmpthese'; my @kwlist = qw(foo bar baz qux q0 q1 q2 q3 q4 q5 q6 q7 q8 q9); my %kwhash = map {($_, 0)} @kwlist; my $kwre = join('\|',@kwlist); #my @items = getfromdb("select id,text from table"); # I have a sub th +at does this my @items = ([1, '{"foo":0}'], [2, '{"foo":0,"qux":0}'], [3, '{"foo":0}'], [4, '{"baz":0}'], [5, '{"twx":0}']); # Wildly gu +essed data sub var1 { foreach my $i (0..$#items) { my ($id,$text) = @{$items[$i]}; my $count = 0; while($text =~ /$kwre/g) { $count++; } } } sub var2 { foreach my $i (0..$#items) { my ($id,$temp) = @{$items[$i]}; my $kws = decode_json($temp); my $count = 0; foreach my $k (@kwlist) { if (defined($kws->{$k})) { $count++; } } } } sub var3 { foreach my $item (@items) { my (undef,$text) = @$item; my $kws = decode_json($text); my $count = 0; $count += exists $kwhash{$_} for keys %$kws; } } cmpthese(0, { var1 => \&var1, var2 => \&var2, var3 => \&var3, }); __DATA__ Rate var2 var1 var3 var2 176400/s -- -31% -38% var1 254510/s 44% -- -11% var3 285034/s 62% 12% -- [download] Greetings, 🐻 `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l]
Re: Faster of two options for finding keywords in a database by NERDVANA (Priest) on Feb 17, 2024 at 19:48 UTC
Assuming what you really want to do is find the most relevant record for a search, neither option is going to give you good performance on a large database. The "right tool for the job" is either the Postgres fulltext search feature, or using a second database like ElasticSearch. The built-in postgres keyword search isn't as good as ElasticSearch, but it's right at your fingertips when your database is already postgres. It works like this: Declare a column for your keywords, and write some clever code to fill it with all the keyword combinations that matter, including duplicates for the important ones. Put a fulltext index on that column Sort your postgres query by the ts_rank Here's an example of the DDL: `CREATE TABLE sometable ( ... keyword_search text not null ); CREATE INDEX sometable_fulltext ON sometable USING GIN (to_tsvector('english', keyword_search));` [download] obviously change 'english' if your users have a different primary language. If you have multiple columns to search, you can make a generated column that combines them into the keywords: `keyword_search text not null GENERATED ALWAYS AS ( name \|\| ' ' \|\| COALESCE(description, '') \|\| ' ' \|\| COALESCE(tags, '')) STORED,` [download] You select from this table like `SELECT * FROM sometable WHERE to_tsvecor('english',keyword_search) @@ plainto_tsquery('english +', ?) ORSER BY ts_rank(to_tsvecor('english',keyword_search), plainto_tsquery +('english', ?))` [download] And then substitute your search for both '?' in that query. If you're interested in integrating this with DBIx::Class, I can show some neat code for that. The ElasticSearch solution is more involved. You would set up a separate ElasticSsearch database and then every time your application modifies content you also push a summary of that content into the elasticsearch database, then query that database instead of postgres to get your list of entity IDs to show to the user, then query postgres for the entities using the list of IDs, then render them for the user.	[reply] [d/l] [select]
Re^2: Faster of two options for finding keywords in a database by Danny (Chaplain) on Feb 18, 2024 at 02:46 UTC
Or possibly more simply just keep a database lookup table with columns: keyword \|\| id_reference_to_text_column then query that table with your keywords like "select id_reference_to_text_column from lookup_table where keyword in (key1, key2, ...)" or however you do that in postgres. Then you could simply sum up the occurrences of each id_reference_to_text_column. Or faster with group by (mysql syntax) `create table tmp (kwd varchar(255), idx integer, primary key (kwd, idx))` populate table ... then query it like `select idx, count(*) from tmp where kwd in ("heugh", "stook", "tanti") + group by idx` [download]	[reply] [d/l] [select]
Re^3: Faster of two options for finding keywords in a database by NERDVANA (Priest) on Feb 18, 2024 at 03:46 UTC
This is at least cross-platform, but it's also sort of what the postgres fulltext feature does. The postgres fulltext should still run faster though, and I think it uses much more compact storage than creating one row per keyword per original object with a b-tree on the whole thing. Also debatable which is "simpler", because with a table of keyword-to-record-id you have to maintain that yourself. With the one I showed, once you get the table and query set up, postgres does all the work to maintain the index.	[reply]
Re: Faster of two options for finding keywords in a database by Tux (Canon) on Feb 19, 2024 at 15:48 UTC
`psql=# drop operator if exists =~ (text, text); psql=# drop function if exists pcre_match (text, text); psql=# create function pcre_match (text, text) returns bool language plperl immutable leakproof strict parallel safe as $$ return $_[0] =~ $_[1] ? 1 : 0; $$; psql=# create operator =~ (function = pcre_match, leftarg = text, righ +targ = text); my @items = getfromdb ("select id, text from table where text =~ '$kwr +e'");` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l]
Re: Faster of two options for finding keywords in a database by Danny (Chaplain) on Feb 17, 2024 at 17:40 UTC
It seems like the hash lookup should be faster than doing a bunch of regex matches, but can't you just throw some examples at it and time the two?	[reply]
Re: Faster of two options for finding keywords in a database by LanX (Saint) on Feb 17, 2024 at 17:22 UTC
I don't know the PG features you are using, I can only guess that the fields `text` and `keywordhash` are both the same JSON string. Using a regex here is not a wise decision, even with proper delimiters (which are dangerously lacking). If destructing� the hash is fine I'd try `delete @kws{@kwlist}` (see `delete` and hash slice) and calculate the difference of `scalar keys %kws` before and after. But - as you already mentioned- I suppose PG has already a built-in hash slice feature you could use for faster execution. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery} update FWIW your use of `defined` is also error prone, `exists` it's the way to go. Footnotes �) and if destructing is not ok consider `delete local @kws{@kwlist}` , DISCLAIMER I've never used it, wasn't even actively aware of it.	[reply] [d/l] [select]
Re^2: Faster of two options for finding keywords in a database by cormanaz (Deacon) on Feb 19, 2024 at 18:20 UTC
"FWIW your use of defined is also error prone, exists it's the way to go." More about this, please?	[reply]
Re^3: Faster of two options for finding keywords in a database by LanX (Saint) on Feb 20, 2024 at 00:05 UTC
As soon as `undef` is used as value, your code will break. Better be safe than sorry. Debugger demo: `DB<2> $h{a}=1 DB<3> $h{b}=undef DB<4> x grep {defined $h{$_}} "a".."d" 0 'a' DB<5> x grep {exists $h{$_}} "a".."d" 0 'a' 1 'b' DB<6>` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: Faster of two options for finding keywords in a database by erix (Prior) on Feb 19, 2024 at 15:24 UTC
I'd like to try an exotic postgres SQL query but your question does not contain database specific detail (I mean, a JSONB column, containing a 'hash'? what does that mean?) If requirements make it possible to use postgresql JSONB indexing it can be extremely fast.	[reply]

update

Footnotes