There's more than one way to do things PerlMonks

### Re: String Comparison & Equivalence Challenge

by erix (Prior)
 on Mar 14, 2021 at 08:51 UTC Need Help??

in reply to String Comparison & Equivalence Challenge

PostgreSQL's trigram-comparing doesn't fare too badly, I think. The trigram functionality is inside module pg_trgm (a contrib module).

I compared your example sentences and asked pg_trgm what similarity it thought they had. pg_trgm expresses that similarity as a number between 0 and 1 (='very different' to 'virtually the same'):

```           prior            | strict_word_similarity |           initi
+al part
----------------------------+------------------------+----------------
+--------------------------
(Identical)                |                      1 | And it came to
+pass, when king Hezekia...
(Only punctuation differs) |                      1 | Thou shalt not
+take the name of the LO...
(Similar)                  |              0.8333333 | The fool hath s
+aid in his heart, There...
(Should rank as similar)   |                    0.8 | Thou shalt not
+steal.
(Less similar)             |             0.45614034 | In the beginnin
+g God created the heave...
(5 rows)

-- (column 3 truncated)
[download]```

The SQL I used (the db must have pg_trgm installed, which you can do with CREATE EXTENSION pg_trgm;):

```
select
prior
-- , similarity(txt1, txt2)
, strict_word_similarity(txt1, txt2)
, substring(txt1, 1, 40) -- || chr(10) || txt2

from (values
( '2 Kings 19:1'    , 'And it came to pass, when king Hezekiah hea
+rd it, that he rent his clothes, and covered himself with sackcloth,
+and went into the house of the LORD.',
'Isaiah 37:1'     , 'And it came to pass, when king Hezekiah hea
+rd it, that he rent his clothes, and covered himself with sackcloth,
+and went into the house of the LORD.' ,
'(Identical)')
, ( 'Exodus 20:7', 'Thou shalt not take the name of the LORD thy God
+ in vain; for the LORD will not hold him guiltless that taketh his na
+me in vain.',
'Deuteronomy 5:11', 'Thou shalt not take the name of the LORD th
+y God in vain: for the LORD will not hold him guiltless that taketh h
+is name in vain.',
'(Only punctuation differs)' )
,
( 'Psalm 14:1'      , 'The fool hath said in his heart, There is n
+o God. They are corrupt, they have done abominable works, there is no
+ne that doeth good.',
'Psalm 53:1'      , 'The fool hath said in his heart, There is n
+o God. Corrupt are they, and have done abominable iniquity: there is
+none that doeth good.',
'(Similar)' )
,
( 'Exodus 20:15'    , 'Thou shalt not steal.',
'Deuteronomy 5:19', 'Neither shalt thou steal.',
'(Should rank as similar)' )
,
( 'Genesis 1:1'     , 'In the beginning God created the heaven and
+ the earth.',
'John 1:1'        , 'In the beginning was the Word, and the Word
+ was with God, and the Word was God.',
'(Less similar)' )
)
as f(verse1, txt1, verse2, txt2, prior)
[download]```

Replies are listed 'Best First'.
Re^2: String Comparison & Equivalence Challenge
by Polyglot (Friar) on Mar 14, 2021 at 09:36 UTC

That does appear like an interesting function with relevant output. How would one find a similar function for MariaDB?

Blessings,

~Polyglot~

Sorry, no idea. I haven't tried. (Most likely it doesn't exist.)

Loading your file into postgres wasn't hard, I just tried it. YMMV, of course, especially when you don't have postgresql installed yet.

In case it helps, here is a quick-'n-dirty load into a (postgres!) table, of your file.

```#!/bin/bash

file=KJV_fixed.csv
schema=public
table=kjv
t=\$schema.\$table

echo "
drop   table if     exists \$t ;
create table if not exists \$t (
recordnum serial PRIMARY KEY
, book      int
, chapter   int
, verse     int
, text      text
)
" | psql -X

< \$file perl -ne 'chomp;
my @arr = split(/[,]/, \$_, 4);
print join("\t", @arr), "\n";
' | psql -c "
copy \$t(book, chapter, verse, text)
from stdin (format csv, header false, delimiter E'\t');
analyze \$t;
"
[download]```

Output from that is:

```DROP TABLE
CREATE TABLE
Timing is on.
ANALYZE
Time: 326.648 ms
[download]```

Thank you for taking your time to answer and for your suggestions.

I did a little searching and found nGram, but that doesn't seem quite the same thing.

The math on LanX's "tf-idf" option goes well over my head--to the point where it's not an option for me. (I got an "A" in college Calculus only by studying four hours for it every day, and asking peers lots of questions--but never really understood it, and it's all long gone from my memory.)

I'm sorely tempted to arrange for some access to a PostgreSQL DB just to try this. But how would one go about checking the similarity index for each verse? by iterating through the table 31,102-squared times? Does the DB generate the index on the fly? How could the results be efficiently stored? (I've never used indexes before, so I'm entirely unfamiliar with the process.)

Blessings,

~Polyglot~

Log In?
 Username: Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11129604]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2022-01-28 23:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
In 2022, my preferred method to securely store passwords is:

Results (74 votes). Check out past polls.

Notices?