Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: String Comparison & Equivalence Challenge

by erix (Prior)
on Mar 14, 2021 at 08:51 UTC ( [id://11129604]=note: print w/replies, xml ) Need Help??


in reply to String Comparison & Equivalence Challenge

PostgreSQL's trigram-comparing doesn't fare too badly, I think. The trigram functionality is inside module pg_trgm (a contrib module).

I compared your example sentences and asked pg_trgm what similarity it thought they had. pg_trgm expresses that similarity as a number between 0 and 1 (='very different' to 'virtually the same'):

prior | strict_word_similarity | initi +al part ----------------------------+------------------------+---------------- +-------------------------- (Identical) | 1 | And it came to +pass, when king Hezekia... (Only punctuation differs) | 1 | Thou shalt not +take the name of the LO... (Similar) | 0.8333333 | The fool hath s +aid in his heart, There... (Should rank as similar) | 0.8 | Thou shalt not +steal. (Less similar) | 0.45614034 | In the beginnin +g God created the heave... (5 rows) -- (column 3 truncated)

The SQL I used (the db must have pg_trgm installed, which you can do with CREATE EXTENSION pg_trgm;):

select prior -- , similarity(txt1, txt2) , strict_word_similarity(txt1, txt2) , substring(txt1, 1, 40) -- || chr(10) || txt2 from (values ( '2 Kings 19:1' , 'And it came to pass, when king Hezekiah hea +rd it, that he rent his clothes, and covered himself with sackcloth, +and went into the house of the LORD.', 'Isaiah 37:1' , 'And it came to pass, when king Hezekiah hea +rd it, that he rent his clothes, and covered himself with sackcloth, +and went into the house of the LORD.' , '(Identical)') , ( 'Exodus 20:7', 'Thou shalt not take the name of the LORD thy God + in vain; for the LORD will not hold him guiltless that taketh his na +me in vain.', 'Deuteronomy 5:11', 'Thou shalt not take the name of the LORD th +y God in vain: for the LORD will not hold him guiltless that taketh h +is name in vain.', '(Only punctuation differs)' ) , ( 'Psalm 14:1' , 'The fool hath said in his heart, There is n +o God. They are corrupt, they have done abominable works, there is no +ne that doeth good.', 'Psalm 53:1' , 'The fool hath said in his heart, There is n +o God. Corrupt are they, and have done abominable iniquity: there is +none that doeth good.', '(Similar)' ) , ( 'Exodus 20:15' , 'Thou shalt not steal.', 'Deuteronomy 5:19', 'Neither shalt thou steal.', '(Should rank as similar)' ) , ( 'Genesis 1:1' , 'In the beginning God created the heaven and + the earth.', 'John 1:1' , 'In the beginning was the Word, and the Word + was with God, and the Word was God.', '(Less similar)' ) ) as f(verse1, txt1, verse2, txt2, prior)

Replies are listed 'Best First'.
Re^2: String Comparison & Equivalence Challenge
by Polyglot (Chaplain) on Mar 14, 2021 at 09:36 UTC

    That does appear like an interesting function with relevant output. How would one find a similar function for MariaDB?

    Blessings,

    ~Polyglot~

      Sorry, no idea. I haven't tried. (Most likely it doesn't exist.)

      Loading your file into postgres wasn't hard, I just tried it. YMMV, of course, especially when you don't have postgresql installed yet.

      In case it helps, here is a quick-'n-dirty load into a (postgres!) table, of your file.

      #!/bin/bash file=KJV_fixed.csv schema=public table=kjv t=$schema.$table echo " drop table if exists $t ; create table if not exists $t ( recordnum serial PRIMARY KEY , book int , chapter int , verse int , text text ) " | psql -X < $file perl -ne 'chomp; my @arr = split(/[,]/, $_, 4); print join("\t", @arr), "\n"; ' | psql -c " copy $t(book, chapter, verse, text) from stdin (format csv, header false, delimiter E'\t'); analyze $t; "

      Output from that is:

      DROP TABLE CREATE TABLE Timing is on. ANALYZE Time: 326.648 ms

        Thank you for taking your time to answer and for your suggestions.

        I did a little searching and found nGram, but that doesn't seem quite the same thing.

        The math on LanX's "tf-idf" option goes well over my head--to the point where it's not an option for me. (I got an "A" in college Calculus only by studying four hours for it every day, and asking peers lots of questions--but never really understood it, and it's all long gone from my memory.)

        I'm sorely tempted to arrange for some access to a PostgreSQL DB just to try this. But how would one go about checking the similarity index for each verse? by iterating through the table 31,102-squared times? Does the DB generate the index on the fly? How could the results be efficiently stored? (I've never used indexes before, so I'm entirely unfamiliar with the process.)

        Blessings,

        ~Polyglot~

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11129604]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-04-19 19:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found