Re: String Comparison & Equivalence Challenge

PostgreSQL's trigram-comparing doesn't fare too badly, I think. The trigram functionality is inside module pg_trgm (a contrib module).

I compared your example sentences and asked pg_trgm what similarity it thought they had. pg_trgm expresses that similarity as a number between 0 and 1 (='very different' to 'virtually the same'):

           prior            | strict_word_similarity |           initi
+al part                
----------------------------+------------------------+----------------
+--------------------------
 (Identical)                |                      1 | And it came to 
+pass, when king Hezekia... 
 (Only punctuation differs) |                      1 | Thou shalt not 
+take the name of the LO...
 (Similar)                  |              0.8333333 | The fool hath s
+aid in his heart, There...
 (Should rank as similar)   |                    0.8 | Thou shalt not 
+steal.
 (Less similar)             |             0.45614034 | In the beginnin
+g God created the heave... 
(5 rows)

-- (column 3 truncated)
[download]

The SQL I used (the db must have pg_trgm installed, which you can do with CREATE EXTENSION pg_trgm;):


select 
     prior
-- , similarity(txt1, txt2)  
   , strict_word_similarity(txt1, txt2)  
   , substring(txt1, 1, 40) -- || chr(10) || txt2

from (values 
    ( '2 Kings 19:1'    , 'And it came to pass, when king Hezekiah hea
+rd it, that he rent his clothes, and covered himself with sackcloth, 
+and went into the house of the LORD.',
      'Isaiah 37:1'     , 'And it came to pass, when king Hezekiah hea
+rd it, that he rent his clothes, and covered himself with sackcloth, 
+and went into the house of the LORD.' ,
      '(Identical)')
  , ( 'Exodus 20:7', 'Thou shalt not take the name of the LORD thy God
+ in vain; for the LORD will not hold him guiltless that taketh his na
+me in vain.',
      'Deuteronomy 5:11', 'Thou shalt not take the name of the LORD th
+y God in vain: for the LORD will not hold him guiltless that taketh h
+is name in vain.',
      '(Only punctuation differs)' )
  ,
    ( 'Psalm 14:1'      , 'The fool hath said in his heart, There is n
+o God. They are corrupt, they have done abominable works, there is no
+ne that doeth good.',
      'Psalm 53:1'      , 'The fool hath said in his heart, There is n
+o God. Corrupt are they, and have done abominable iniquity: there is 
+none that doeth good.',
      '(Similar)' )
  ,
    ( 'Exodus 20:15'    , 'Thou shalt not steal.',
      'Deuteronomy 5:19', 'Neither shalt thou steal.',
      '(Should rank as similar)' )
  ,
    ( 'Genesis 1:1'     , 'In the beginning God created the heaven and
+ the earth.',
      'John 1:1'        , 'In the beginning was the Word, and the Word
+ was with God, and the Word was God.',
       '(Less similar)' )
) 
           as f(verse1, txt1, verse2, txt2, prior)
[download]

Comment on Re: String Comparison & Equivalence Challenge Select or Download Code

Replies are listed 'Best First'.
Re^2: String Comparison & Equivalence Challenge by Polyglot (Chaplain) on Mar 14, 2021 at 09:36 UTC
That does appear like an interesting function with relevant output. How would one find a similar function for MariaDB? Blessings, ~Polyglot~	[reply]
Re^3: String Comparison & Equivalence Challenge by erix (Prior) on Mar 14, 2021 at 10:33 UTC
Sorry, no idea. I haven't tried. (Most likely it doesn't exist.) Loading your file into postgres wasn't hard, I just tried it. YMMV, of course, especially when you don't have postgresql installed yet. In case it helps, here is a quick-'n-dirty load into a (postgres!) table, of your file. `#!/bin/bash file=KJV_fixed.csv schema=public table=kjv t=$schema.$table echo " drop table if exists $t ; create table if not exists $t ( recordnum serial PRIMARY KEY , book int , chapter int , verse int , text text ) " \| psql -X < $file perl -ne 'chomp; my @arr = split(/[,]/, $_, 4); print join("\t", @arr), "\n"; ' \| psql -c " copy $t(book, chapter, verse, text) from stdin (format csv, header false, delimiter E'\t'); analyze $t; "` [download] Output from that is: `DROP TABLE CREATE TABLE Timing is on. ANALYZE Time: 326.648 ms` [download]	[reply] [d/l] [select]
Re^4: String Comparison & Equivalence Challenge by Polyglot (Chaplain) on Mar 14, 2021 at 10:51 UTC
Thank you for taking your time to answer and for your suggestions. I did a little searching and found nGram, but that doesn't seem quite the same thing. The math on LanX's "tf-idf" option goes well over my head--to the point where it's not an option for me. (I got an "A" in college Calculus only by studying four hours for it every day, and asking peers lots of questions--but never really understood it, and it's all long gone from my memory.) I'm sorely tempted to arrange for some access to a PostgreSQL DB just to try this. But how would one go about checking the similarity index for each verse? by iterating through the table 31,102-squared times? Does the DB generate the index on the fly? How could the results be efficiently stored? (I've never used indexes before, so I'm entirely unfamiliar with the process.) Blessings, ~Polyglot~	[reply]
Re^5: String Comparison & Equivalence Challenge by erix (Prior) on Mar 14, 2021 at 11:11 UTC
Re^5: String Comparison & Equivalence Challenge (tf-idf) by LanX (Saint) on Mar 14, 2021 at 16:10 UTC
Re^5: String Comparison & Equivalence Challenge by erix (Prior) on Mar 15, 2021 at 13:00 UTC
Re^6: String Comparison & Equivalence Challenge by Polyglot (Chaplain) on Mar 15, 2021 at 13:40 UTC
Some notes below your chosen depth have not been shown here


Welcome to the Monastery
	PerlMonks