Scotsman has asked for the wisdom of the Perl Monks concerning the following question:

Hi there

I'm looking for help with an SQL problem. I have 2 tables which contain 'keywords' (Keyword1-6 in each table). I'm trying to find a way to efficiently find all records in one table that match one or more of the keywords extracted from a record in the other table, and allow me to sort results in order of highest match.

Sorry if its tough to follow. So for example, in one table i have some records like this:

Record 1:

Keyword1 "Architect"

keyword2 "Major projects"

Keyword3 "London"

Record 2:

Keyword1 "Architect"

Keyword2 "Small projects"

Keyword3 "Sydney"

etc.

and in the other table I have a record which I want to search with

Keyword1 "Architect"

Keyword2 "Hong Kong"

Keyword3 "Small"

say.

I can put those into a string, thus "Architect Sydney Small", and search through the first table to find all full/partial matches.

Since both records in the first table match the search at least partially, I want to return both records from the second table, sorted in order of the best match, eg.

Record2 - 3 keyword matches (one partial)

Record1 - 2 matches

Note the order of Keywords may be different in both tables as shown above. I'm just trying to find anything that matches and sort into the best match order. Hope that's clear.

I've been trying a join but without much success. Any ideas on the best approach?

Thanks guys.

Replies are listed 'Best First'.
Re: Finding best matches
by Corion (Patriarch) on Sep 12, 2013 at 09:37 UTC

    While I'm really fond of SQL for various things, I don't think your current table structure lends itself to good searching.

    I see two potential approaches:

    • Restructure your "keywords" table so it contains tuples of (user,keyword). If you have the appropriate indices, you can then search based on either user or keyword and find the "best" matches:
      select user, count(*) as hits from ... where keyword in ('Architect', 'Sydney', 'Small') group by user order by hits desc
    • If you want a quick/better solution specialized for search, which already knows about (English) stemming etc., consider looking at ElasticSearch. I don't know how applicable it is for simple keyword search, but at least for documents a little bit larger, it seems to attract a lot of followers.
Re: Finding best matches
by hdb (Monsignor) on Sep 12, 2013 at 09:40 UTC

    Two questions:

    1. Why do you have "Hong Kong" in the "other" table but "Sydney" in your "string"?
    2. Why do you have 2 matches on Record1?
    This is all very confusing!

Re: Finding best matches
by keszler (Priest) on Sep 12, 2013 at 13:11 UTC
    IMHO, the best approach would be to normalize the database. Unless every record has six keywords, no more, no less, ever, the keywords should be in their own table linked by some record id.

    That said, I'm not seeing the Perl question here...

Re: Finding best matches
by kcott (Archbishop) on Sep 12, 2013 at 23:36 UTC

    G'day Scotsman,

    Welcome to the monastery.

    If you've come here to ask a purely SQL question, you're basically in the wrong place. However, assuming you've extracted that data (e.g. via DBI or similar) into Perl data structures, then this might do what you want:

    $ perl -Mstrict -Mwarnings -E ' my @source_data = ( ["Archeologist", "Ancient projects", "Cairo"], ["Architect", "Major projects", "London"], ["Architect", "Small projects", "Sydney"], ["Architect", "Small", "Hong Kong"], ["Programmer", "Big", "Hong Kong"], ); my @search_data = ("Architect", "Hong Kong", "Small"); my @found_data; my $re = "(?:" . join("|", @search_data) . ")"; for (@source_data) { my $matches = 0; /$re/ && ++$matches for @$_; push @found_data, [$matches, $_] if $matches; } say "@{$_->[1]}" for sort { $b->[0] <=> $a->[0] } @found_data; ' Architect Small Hong Kong Architect Small projects Sydney Architect Major projects London Programmer Big Hong Kong

    If you have follow-up questions, that's fine, but please read the "How do I post a question effectively?" guidelines before posting them.

    -- Ken