String Comparison & Equivalence Challenge

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

BACKGROUND

Suppose we start with a database of the Bible, which has ~31,102 verses. Some of the verses are similar to each other. Some are textually identical. Suppose we wish to alert readers of one verse to other verses which are very similar or are the same; or perhaps return search results ranked by similarity.

CHALLENGE

1) How would we find and rank the similarity of other verses, such as on a percentage of equivalence/similarity?

2) How would we index the rankings?

EXAMPLE VERSES

(Identical)

2 Kings 19:1 -- And it came to pass, when king Hezekiah heard it, that he rent his clothes, and covered himself with sackcloth, and went into the house of the LORD.

Isaiah 37:1 -- And it came to pass, when king Hezekiah heard it, that he rent his clothes, and covered himself with sackcloth, and went into the house of the LORD.

(Only punctuation differs)

Exodus 20:7 -- Thou shalt not take the name of the LORD thy God in vain; for the LORD will not hold him guiltless that taketh his name in vain.

Deuteronomy 5:11 -- Thou shalt not take the name of the LORD thy God in vain: for the LORD will not hold him guiltless that taketh his name in vain.

(Similar)

Psalm 14:1 -- The fool hath said in his heart, There is no God. They are corrupt, they have done abominable works, there is none that doeth good.

Psalm 53:1 -- The fool hath said in his heart, There is no God. Corrupt are they, and have done abominable iniquity: there is none that doeth good.

(Should rank as similar)

Exodus 20:15 -- Thou shalt not steal.

Deuteronomy 5:19 -- Neither shalt thou steal.

(Less similar)

Genesis 1:1 -- In the beginning God created the heaven and the earth.

John 1:1 -- In the beginning was the Word, and the Word was with God, and the Word was God.

For any given verse there will be a different scale of similarities to other verses. Some verses may be so unique that almost no other verse would approach them in similarity. Other verses may have one or more identical or virtually identical repeats. Finding the nearest equivalent, which may not be very equivalent, might be a challenge. For some verses, perhaps many similar verses could be found, whereas for others the list may be very short--or even none at all, depending on where the line is drawn for threshold of equivalency.

All of this depends on being able, first and foremost, to measure the equivalence of two different strings. Secondarily, how should these rankings be preserved?

As a bonus, finding the most efficient way of indexing the table might be interesting. Does one iterate over each of the 31,102 verses once for every verse?

Enjoy!

RESOURCES

(Claim to have free SQL-format Bibles, with sign-up)

http://biblehub.net

- OR -

(Free CSV download of KJV Bible, easily converted to database [script below])

http://my-bible-study.appspot.com/assets/KJV_fixed.csv

And here's a script to use to push the CSV file into a database:

CREATE BIBLE DATABASE

#!/usr/bin/perl

use DBI;
use strict;
use warnings;


# DEFAULT SCRIPT USER
our $db_user_name = 'root';

# DEFAULT PASSWORD FOR SCRIPT USER
our $db_password = '[mysql_root_password_here]';

our $database = "Bibles";
our $table = "KJV"; 
our $dbfilename = 'KJV_fixed.csv';

our $DEBUG=0;

# Treat all input and output as UTF-8 and set the flags correctly
# Because _everyone_ should be using UTF8 these days!
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
binmode STDIN,  ":utf8";


#################
#begin
    
    &initializeDB;
    my @source = &readsource($dbfilename);
    &filltable(@source);
    exit;
    
#end
#################

sub initializeDB {

    print "Creating database: $database\n";
    my $dsn = "DBI:mysql:host=localhost";    
    my $statement = qq| CREATE DATABASE IF NOT EXISTS $database CHARAC
+TER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci; |;
    &connectdb($statement,$dsn);
    
} #END SUB initializeDB

sub filltable {
my @source = @_;
my ($booknum, $chapternum, $verse, $text, $other) = ('', '', '', '', '
+');
my $line = '';
my $count = 0;
my $percent = 0;

    my $statement = qq|create table if not exists $table (RecordNum SM
+ALLINT NOT NULL PRIMARY KEY AUTO_INCREMENT, Book TINYINT, Chapter SMA
+LLINT, Verse SMALLINT, Text text) CHARACTER SET utf8mb4 COLLATE utf8m
+b4_unicode_ci ENGINE=MyISAM DEFAULT CHARSET=utf8mb4;|;               
+              
    &connectdb($statement);        
            
    print "Filling the table with the data...\n\nThis could take awhil
+e.\n";
    print "Total lines: ".scalar @source."\n";
    
    foreach $line (@source) {
        $count++;

        if ($count/3110 == int($count/3110)) { 
            $percent++;
            print "${percent}0\% completed...\n";
        } 
        
        chomp $line;
        $line =~ s/'/\\'/g;
        $line =~ s/(\d+),(\d+),(\d+),"(.*?)"/$1\t$2\t$3\t$4/g;
            
        ($booknum, $chapternum, $verse, $text, $other) = split(/\t/, $
+line);
                
        $text =~ s/^\s+//;
        $text =~ s/\s+$//;
        
        $statement = qq|INSERT INTO $table (RecordNum, Book, Chapter, 
+Verse, Text) VALUES (null, '$booknum', '$chapternum', '$verse', '$tex
+t'); |; 
        &connectdb($statement);    
    }
} #END SUB filltable

sub readsource {
my $filename=shift @_;
my @data = ();
    print "\nReading source: $filename...";

    open SOURCE, "<$filename" or die "Cannot open source file $!\n";
    @data = <SOURCE>;
    close SOURCE;
        
    print "Done.\n\n";
    
    return @data;
} #END SUB readsource        


sub connectdb { if ($DEBUG==9) {print "sub connectdb: \n"}; 
my $statement = shift @_;
my $dsn = shift @_ || "DBI:mysql:$database:localhost";    
    
    my $dbh = DBI->connect($dsn, $db_user_name, $db_password, {
                         mysql_enable_utf8 => 1
                    }) or die "Can't connect to the DB: $DBI::errstr\n
+";  
    my $quest = $dbh->prepare($statement, { RaiseError => 1 }) or die 
+"Cannot prepare statement! $DBI::errstr\n";
    if ($DEBUG>0) {
      $quest->execute() or die qq|CONNECT DATABASE Statement:  \n $sta
+tement \n
        \n Error in first database statement! \n $DBI::errstr \n |;
    } else { $quest->execute() };
      
} # END SUB connectdb
[download]

READ BIBLE DATABASE

#!/usr/bin/perl

use DBI;
use strict;
use warnings;


# DEFAULT SCRIPT USER
our $db_user_name = 'root';

# DEFAULT PASSWORD FOR SCRIPT USER
our $db_password = '[mysql_root_password_here]';

our $database = "Bibles";
our $table = "KJV"; 

our $DEBUG=0;

# Treat all input and output as UTF-8 and set the flags correctly
# Because _everyone_ should be using UTF8 these days!
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
binmode STDIN,  ":utf8";


#################
#begin
    
    my @results = &read_table_from_DB;
    &processResults(@results);
    exit;
    
#end
#################

sub processResults {
my @data = @_;
my ($record, $booknum, $chapternum, $versenum, $text) = ('', '', '', '
+', '');

#OPTIONAL FORM OF ITERATION: 
#  JUST KEEP IN MIND THAT ARE FIVE (5)
#  RECORDS FOR EACH "ROW" 

    while (@data) {
        $record = shift @data;
        $booknum = shift @data;
        $chapternum = shift @data;
        $versenum = shift @data;
        $text = shift @data;
        
        #CODE TO ANALYZE AND COMPARE AMONG TEXTS
        #print $record."\n"; #WILL GENERATE CONSIDERABLE TEXT TO SCREE
+N        
    
    }


} #END SUB processResults

sub read_table_from_DB {
my @results = ();
my ($record, $booknum, $chapternum, $verse, $text, $other) = ('', '', 
+'', '', '', '');
my $line = '';
my $count = 0;

    my $statement = qq|SELECT * FROM $table;|;                        
+     
    @results = &connectdb($statement);        

} #END SUB read_table_from_DB


sub connectdb { if ($DEBUG==9) {print "sub connectdb: \n"}; 
my $statement = shift @_;
my $dsn = shift @_ || "DBI:mysql:$database:localhost";    
my @results = ();
   
    my $dbh = DBI->connect($dsn, $db_user_name, $db_password, {
                         mysql_enable_utf8 => 1
                    }) or die "Can't connect to the DB: $DBI::errstr\n
+";  
    my $quest = $dbh->prepare($statement, { RaiseError => 1 }) or die 
+"Cannot prepare statement! $DBI::errstr\n";
    if ($DEBUG>0) {
      $quest->execute() or die qq|CONNECT DATABASE Statement:  \n $sta
+tement \n
        \n Error in first database statement! \n $DBI::errstr \n |;
    } else { $quest->execute() };
    
    
    while(my @row = $quest->fetchrow_array()) {
        foreach my $item (@row) {
            push @results, $item;
        }
    }

    return @results;
     
} # END SUB connectdb
[download]

Blessings,

~Polyglot~

Back to Seekers of Perl Wisdom