cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Good day monks. I have text stored in a postgres database. For each record, I also have a JSONB column containing a hash of keywords previously found in the text. I want to count the number of a particular subset of keywords in each record. My question is whether it is faster to do it using option 1:
my @kwlist = qw(foo bar baz qux); my $kwre = join('|',@kwlist); my @items = getfromdb("select id,text from table"); # I have a sub tha +t does this foreach my $i (0..$#items) { my ($id,$text) = @{$items[$i]}; my $count = 0; while($text =~ /$kwre/g) { $count++; } }
or using option 2:
use JSON::XS; my @kwlist = qw(foo bar baz qux); my @items = getfromdb("select id,keywordhash from table"); # I have a +sub that does this foreach my $i (0..$#items) { my ($id,$temp) = @{$items[$i]}; my $kws = decode_json($temp); my $count = 0; foreach my $k (@kwlist) { if (defined($kws->{$k})) { $count++; } } }
Or is there some faster way to do it I haven't thought of? I have no idea what is going on under the hood in these two options that would affect execution time but I feel confident some other monks do.

BTW I realize there may be some way to do it with an exotic postgres query, but it's beyond me and anyway the list of keywords I'm looking for is rather long so it would be unwieldy.

Replies are listed 'Best First'.
Re: Faster of two options for finding keywords in a database
by jo37 (Curate) on Feb 17, 2024 at 18:18 UTC

    Hi cormanaz!

    Your two options don't seem to solve the same task. Number one counts the appearances of the keywords as substrings in the JSON string, while number two counts exact matches of the keywords in the keys of the JSON.

    The "best" solution depends on the number of keywords vs. the average number of keys in the columns. Assuming the number of keywords is larger, it would make sense to reverse the check. Create a hash from the keywords and lookup the keys from the JSON. Removing some indexing gives some more speedup.

    #!/usr/bin/perl use v5.24; use warnings; use JSON::XS; use Benchmark 'cmpthese'; my @kwlist = qw(foo bar baz qux q0 q1 q2 q3 q4 q5 q6 q7 q8 q9); my %kwhash = map {($_, 0)} @kwlist; my $kwre = join('|',@kwlist); #my @items = getfromdb("select id,text from table"); # I have a sub th +at does this my @items = ([1, '{"foo":0}'], [2, '{"foo":0,"qux":0}'], [3, '{"foo":0}'], [4, '{"baz":0}'], [5, '{"twx":0}']); # Wildly gu +essed data sub var1 { foreach my $i (0..$#items) { my ($id,$text) = @{$items[$i]}; my $count = 0; while($text =~ /$kwre/g) { $count++; } } } sub var2 { foreach my $i (0..$#items) { my ($id,$temp) = @{$items[$i]}; my $kws = decode_json($temp); my $count = 0; foreach my $k (@kwlist) { if (defined($kws->{$k})) { $count++; } } } } sub var3 { foreach my $item (@items) { my (undef,$text) = @$item; my $kws = decode_json($text); my $count = 0; $count += exists $kwhash{$_} for keys %$kws; } } cmpthese(0, { var1 => \&var1, var2 => \&var2, var3 => \&var3, }); __DATA__ Rate var2 var1 var3 var2 176400/s -- -31% -38% var1 254510/s 44% -- -11% var3 285034/s 62% 12% --

    Greetings,
    🐻

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re: Faster of two options for finding keywords in a database
by NERDVANA (Priest) on Feb 17, 2024 at 19:48 UTC
    Assuming what you really want to do is find the most relevant record for a search, neither option is going to give you good performance on a large database. The "right tool for the job" is either the Postgres fulltext search feature, or using a second database like ElasticSearch.

    The built-in postgres keyword search isn't as good as ElasticSearch, but it's right at your fingertips when your database is already postgres. It works like this:

    1. Declare a column for your keywords, and write some clever code to fill it with all the keyword combinations that matter, including duplicates for the important ones.
    2. Put a fulltext index on that column
    3. Sort your postgres query by the ts_rank

    Here's an example of the DDL:

    CREATE TABLE sometable ( ... keyword_search text not null ); CREATE INDEX sometable_fulltext ON sometable USING GIN (to_tsvector('english', keyword_search));

    obviously change 'english' if your users have a different primary language.

    If you have multiple columns to search, you can make a generated column that combines them into the keywords:

    keyword_search text not null GENERATED ALWAYS AS ( name || ' ' || COALESCE(description, '') || ' ' || COALESCE(tags, '')) STORED,
    You select from this table like
    SELECT * FROM sometable WHERE to_tsvecor('english',keyword_search) @@ plainto_tsquery('english +', ?) ORSER BY ts_rank(to_tsvecor('english',keyword_search), plainto_tsquery +('english', ?))
    And then substitute your search for both '?' in that query.

    If you're interested in integrating this with DBIx::Class, I can show some neat code for that.

    The ElasticSearch solution is more involved. You would set up a separate ElasticSsearch database and then every time your application modifies content you also push a summary of that content into the elasticsearch database, then query that database instead of postgres to get your list of entity IDs to show to the user, then query postgres for the entities using the list of IDs, then render them for the user.

      Or possibly more simply just keep a database lookup table with columns:

      keyword || id_reference_to_text_column

      then query that table with your keywords like "select id_reference_to_text_column from lookup_table where keyword in (key1, key2, ...)" or however you do that in postgres. Then you could simply sum up the occurrences of each id_reference_to_text_column.

      Or faster with group by (mysql syntax)

      create table tmp (kwd varchar(255), idx integer, primary key (kwd, idx))

      populate table ... then query it like

      select idx, count(*) from tmp where kwd in ("heugh", "stook", "tanti") + group by idx
        This is at least cross-platform, but it's also sort of what the postgres fulltext feature does. The postgres fulltext should still run faster though, and I think it uses much more compact storage than creating one row per keyword per original object with a b-tree on the whole thing.

        Also debatable which is "simpler", because with a table of keyword-to-record-id you have to maintain that yourself. With the one I showed, once you get the table and query set up, postgres does all the work to maintain the index.

Re: Faster of two options for finding keywords in a database
by Tux (Canon) on Feb 19, 2024 at 15:48 UTC
    psql=# drop operator if exists =~ (text, text); psql=# drop function if exists pcre_match (text, text); psql=# create function pcre_match (text, text) returns bool language plperl immutable leakproof strict parallel safe as $$ return $_[0] =~ $_[1] ? 1 : 0; $$; psql=# create operator =~ (function = pcre_match, leftarg = text, righ +targ = text); my @items = getfromdb ("select id, text from table where text =~ '$kwr +e'");

    Enjoy, Have FUN! H.Merijn
Re: Faster of two options for finding keywords in a database
by Danny (Chaplain) on Feb 17, 2024 at 17:40 UTC
    It seems like the hash lookup should be faster than doing a bunch of regex matches, but can't you just throw some examples at it and time the two?
Re: Faster of two options for finding keywords in a database
by LanX (Saint) on Feb 17, 2024 at 17:22 UTC
    I don't know the PG features you are using, I can only guess that the fields text and keywordhash are both the same JSON string.

    Using a regex here is not a wise decision, even with proper delimiters (which are dangerously lacking).

    If destructing¹ the hash is fine I'd try delete @kws{@kwlist} (see delete and hash slice) and calculate the difference of scalar keys %kws before and after.

    But - as you already mentioned- I suppose PG has already a built-in hash slice feature you could use for faster execution.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

    update

    FWIW your use of defined is also error prone, exists it's the way to go.

    Footnotes

    ¹) and if destructing is not ok consider delete local @kws{@kwlist} , DISCLAIMER I've never used it, wasn't even actively aware of it.

      "FWIW your use of defined is also error prone, exists it's the way to go." More about this, please?
        As soon as undef is used as value, your code will break. Better be safe than sorry.

        Debugger demo:

        DB<2> $h{a}=1 DB<3> $h{b}=undef DB<4> x grep {defined $h{$_}} "a".."d" 0 'a' DB<5> x grep {exists $h{$_}} "a".."d" 0 'a' 1 'b' DB<6>

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

Re: Faster of two options for finding keywords in a database
by erix (Prior) on Feb 19, 2024 at 15:24 UTC

    I'd like to try an exotic postgres SQL query but your question does not contain database specific detail (I mean, a JSONB column, containing a 'hash'? what does that mean?)

    If requirements make it possible to use postgresql JSONB indexing it can be extremely fast.