It turns out this text is not so interesting because there is so much 'formula', for want of a better word, and these large parts of 'formula'-sentence while meaning little will give the comparison a high hit number -- see below. By the way, this will be the same for the approach of Algorithm::Diff's LCS (the TK program that tybalt89 made you). I think, anyway.
Yesterday, I generated comparisons and kept all above 0.25. This produced a table with almost 40M comparisons with their 'similarity' number.
It 'worked', in a way, but the result is still a bit disappointing because of the type of text this is (I think). A more real information text with less repetition, less fluff, if you see what I mean, might be more interesting.
select
substring(cmp.sim::text,1,5) sim
, k1.text || chr(10)
|| k2.text || chr(10)
from kjv_simil_0_25 cmp -- big table
join kjv k1 on id1=k1.id -- table from the KJV file
join kjv k2 on id2=k2.id -- ,,
where k1.book = 50 -- look at just this book
and sim > 0.7 -- remove too different
and sim < 0.85 -- remove too identical
;
sim |
+?column?
+
-------+--------------------------------------------------------------
+---------------------------------------------------------------------
+--
0.833 | The grace of our Lord Jesus Christ be with you all. Amen.
+
+ +
| Brethren the grace of our Lord Jesus Christ be with your spir
+it. Amen.
+ +
|
0.833 | The grace of our Lord Jesus Christ be with you all. Amen.
+
+ +
| The grace of our Lord Jesus Christ be with your spirit. Amen.
+
+ +
|
0.769 | Grace be unto you and peace from God our Father and from the
+Lord Jesus Christ.
+ +
| To Timothy my dearly beloved son: Grace mercy and peace from
+God the Father and Christ Jesus our Lord.
+ +
|
0.769 | Grace be unto you and peace from God our Father and from the
+Lord Jesus Christ.
+ +
| To Titus mine own son after the common faith: Grace mercy and
+ peace from God the Father and the Lord Jesus Christ our Saviour.
+ +
|
0.736 | Grace be unto you and peace from God our Father and from the
+Lord Jesus Christ.
+ +
| Grace be with you mercy and peace from God the Father and fro
+m the Lord Jesus Christ the Son of the Father in truth and love.
+ +
|
0.723 | Grace be unto you and peace from God our Father and from the
+Lord Jesus Christ.
+ +
| Unto Timothy my own son in the faith: Grace mercy and peace f
+rom God our Father and Jesus Christ our Lord.
+ +
|
0.714 | The grace of our Lord Jesus Christ be with you all. Amen.
+
+ +
| The Lord Jesus Christ be with thy spirit. Grace be with you.
+Amen.
+ +
|
0.702 | Now unto God and our Father be glory for ever and ever. Amen.
+
+ +
| Saying Amen: Blessing and glory and wisdom and thanksgiving a
+nd honour and power and might be unto our God for ever and ever. Amen
+.+
|
(8 rows)
Time: 16.557 ms
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|